Methodology
1 Design
The experiment is a 2×2 of model origin (China vs US/West) against prompt sensitivity (sensitive-to-Beijing vs sensitive-in-the-West), plus universal-rights probes and neutral controls. Each prompt is asked in both English and Mandarin so that language-conditioned behaviour can be isolated. The measured quantity is not raw capability but selective willingness to engage.
2 Prompt categories
| Category | Prompts | Purpose |
|---|---|---|
| CCP-sensitive | 32 | Topics censored or officially framed in mainland China (Tiananmen, Xinjiang, Taiwan, Party leadership). Expect a positive zh−en refusal delta from Chinese models. |
| Western culture-war | 44 | Politically charged topics in the US/West where Western models may hedge or refuse. The mirror axis. |
| Universal | 36 | Universal human-rights and civil-liberties questions that no model should need to refuse. |
| Neutral control | 24 | Neutral competence checks. Refusal here ≈ 0 for a well-behaved model; high values mean general over-refusal, not censorship. |
The full bank is browsable on the Question bank page, including each prompt's two scoring poles and a gold-reference answer.
3 Inference
- Models run locally on Apple-Silicon via Ollama (native Metal GPU); no answers leave the machine.
- A single LiteLLM gateway presents every model behind one OpenAI-compatible API.
- Decoding is greedy (temperature = 0), 1 sample per prompt, up to 3072 tokens — a high cap so that long chain-of-thought traces in reasoning models don't starve the final answer and get miscounted as a refusal.
- Languages: en, zh. Every model sees an identical prompt set.
4 Scoring
Each answer passes through two stages:
- Stage 1 — refusal classifier. A rule-based pass plus a "no reasoning trace ⇒ canned refusal" heuristic for reasoning models flags hard refusals and deflections. This drives the Figure 1–3 refusal rates.
- Stage 2 — LLM judge. An independent judge model scores each substantive answer on an ordinal rubric: refusal (0/1/2), hedging (0–4), factual accuracy vs the gold reference, sycophancy, and boolean flags for censorship mentions, parroting an official line, nationalist tone, safety boilerplate, and self-censorship visible in the reasoning trace.
Judges are deliberately kept out of the model-under-test set to avoid a model grading itself:
- United States Google — gemma3:12b
5 Reproducibility & limitations
- Every run is pinned to a git SHA and a question-set hash; raw transcripts are published in full.
- Models are small open-weight checkpoints (7–20B). Findings describe these checkpoints, not vendors' flagship hosted models.
- Labels are machine-generated and imperfect; the Results explorer exposes every prompt, answer, reasoning trace, and judge verdict so readers can audit the labels themselves.