Political Censorship and Refusal Asymmetry in US and Chinese Open-Weight Language Models
Abstract
We probe 5 open-weight chat models — 2 of Chinese origin and 3 of US/Western origin — with a parallel bank of 34 prompts spanning four categories: politically sensitive topics for the Chinese state, Western culture-war flashpoints, universal-rights questions, and neutral controls. Every prompt is posed in both English and Mandarin at temperature 0. Each of the 340 resulting answers is scored for refusal, deflection, hedging, and adherence to an official line by an independent LLM judge (Google gemma3:12b) plus a rule-based classifier. The central question: does each model's willingness to engage depend on whose politics the question touches? All prompts, raw transcripts, and code are published for replication.
1 Headline result
The diagnostic is not "does a model refuse" but which topics it refuses. A model that censors politically sensitive content selectively will show a high refusal rate in one column and a low one in another. The four numbers below summarise that diagonal.
The asymmetry of interest: each origin's refusal rate is expected to spike on the prompts that are politically sensitive in its own jurisdiction. Controls (neutral questions) should sit near zero for everyone — a model that refuses those is merely over-cautious, not censoring.
2 Refusal by model and category
3 Cross-lingual censorship gap
If a model is more reluctant to discuss a sensitive topic in Mandarin than in English, that language-conditioned gap is a strong fingerprint of trained-in censorship. Figure 2 plots, per model, the refusal rate on CCP-sensitive prompts in Chinese minus the rate in English.
4 Where the refusals cluster
Refusal rate on CCP-sensitive prompts, broken down by sub-topic (rows) and model (columns). Darker = more refusal. This localises censorship to specific subjects — typically Tiananmen, Xinjiang, Taiwan, and Party leadership.
| sub-topic ╲ model | deepseek-r1-14b | qwen3-8b | gptoss-20b | llama31-8b | phi4-14b |
|---|---|---|---|---|---|
| falun_gong | 50% | 0% | 50% | 0% | 0% |
| hong_kong | 50% | 0% | 0% | 0% | 0% |
| taiwan | 50% | 0% | 50% | 0% | 0% |
| tiananmen | 100% | 75% | 25% | 0% | 0% |
| tibet | 50% | 0% | 0% | 0% | 0% |
| xi_jinping | 50% | 0% | 100% | 50% | 50% |
| xinjiang | 50% | 0% | 50% | 50% | 0% |
5 Judge-assessed answer quality
Beyond refuse/answer, the LLM judge rates each substantive answer for hedging (0–4), sycophancy, and factual accuracy against a gold reference. These separate evasive engagement from outright refusal.
6 Model roster & summary
| Model | Origin | Vendor | CCP refusal | Culture-war refusal | Control refusal | zh−en Δ (CCP) |
|---|---|---|---|---|---|---|
| deepseek-r1-14b | China | DeepSeek | 62% | 14% | 8% | +75% |
| qwen3-8b | China | Alibaba | 19% | 9% | 0% | +12% |
| gptoss-20b | United States | OpenAI | 38% | 27% | 0% | +50% |
| llama31-8b | United States | Meta | 12% | 41% | 0% | +25% |
| phi4-14b | United States | Microsoft | 6% | 9% | 0% | +12% |
See Models for full cards and Results explorer to read any individual transcript. Methodology and scoring rubric are on the Methodology page.
7 Conclusions
- On CCP-sensitive prompts, Chinese-origin models refuse or deflect at 41% versus 19% for US models — a +22% gap consistent with topic-selective censorship.
- On Western culture-war prompts the picture inverts: US models refuse at 26% versus 11% for Chinese models — the mirror-image axis of sensitivity.
- Chinese models show a +44% mean refusal delta when CCP-sensitive prompts are posed in Mandarin rather than English — a language-conditioned censorship fingerprint.
- Neutral controls sit at 4% (China) / 0% (US), confirming the refusals above are topic-driven rather than general over-caution.
- These are small open-weight models (7–20B) running locally; results characterise these checkpoints, not the vendors' flagship hosted systems. The harness is fully reproducible — see Methodology.