Political Censorship and Refusal Asymmetry in US and Chinese Language Models

An automated, reproducible audit · Run #2 · 2026-05-31

Abstract

We probe 11 chat models — 5 of Chinese origin and 6 of US/Western origin (8 open-weight, run locally; 3 commercial, via a hosted API) — with a parallel bank of 104 prompts spanning four categories: politically sensitive topics for the Chinese state, Western culture-war flashpoints, universal-rights questions, and neutral controls. Every prompt is posed in both English and Mandarin at temperature 0. Each of the 2288 resulting answers is scored for refusal, deflection, hedging, and adherence to an official line by an independent LLM judge (Google gemma3:12b, Alibaba qwen3:14b) plus a rule-based classifier. To guard against a Western judge biasing the verdict, every answer is scored by two independent judges — one US, one Chinese — who agree on whether a model refused at Cohen's κ = 0.877 (inter-judge agreement). The central question: does each model's willingness to engage depend on whose politics the question touches? All prompts, raw transcripts, and code are published for replication.

1 Headline result

The diagnostic is not "does a model refuse" but which topics it refuses. A model that censors politically sensitive content selectively will show a high refusal rate in one column and a low one in another. The four numbers below summarise that diagonal.

14%

Chinese models' refusal / deflection on CCP-sensitive prompts

US models' refusal on the same CCP-sensitive prompts

16%

US models' refusal on Western culture-war prompts

Chinese models' refusal on the same culture-war prompts

The asymmetry of interest: each origin's refusal rate is expected to spike on the prompts that are politically sensitive in its own jurisdiction. Controls (neutral questions) should sit near zero for everyone — a model that refuses those is merely over-cautious, not censoring.

Figure 0. The whole thesis in one chart. Each model is placed by its refusal rate on CCP-sensitive prompts (x) versus the combined Western-sensitive axis (y), coloured by origin. A perfectly even-handed model sits near the diagonal; Chinese models are expected toward the lower-right (censor China topics), Western models toward the upper-left (refuse Western-charged topics).

2 Refusal by model and category

Figure 1. Stage-1 refusal/deflection rate per model, on the core 2×2 categories plus controls. Bars are coloured by model origin. Read the gap between a model's CCP-sensitive bar and its control bar.

category ╲ model	deepseek-r1-14b	glm4-9b	qwen3-30b-a3b	qwen3-8b	yi-9b	claude-sonnet-4-6	gptoss-20b	grok-4.3	grok-4.3-reasoning	llama31-8b	phi4-14b
CCP-sensitive	55%	5%	5%	7%	0%	0%	13%	0%	7%	10%	5%
Western culture-war	14%	5%	5%	9%	5%	0%	19%	9%	18%	41%	9%
Protected-identity & offense	11%	22%	17%	22%	17%	0%	28%	11%	11%	22%	33%
Gender & biological sex	11%	6%	0%	0%	0%	0%	0%	0%	28%	22%	22%
Race, crime & group data	0%	8%	0%	0%	0%	0%	4%	0%	12%	8%	8%
Progressive orthodoxy	17%	6%	0%	0%	0%	0%	0%	0%	17%	22%	0%
Universal	28%	6%	0%	0%	0%	0%	6%	0%	6%	22%	6%
Neutral control	8%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%
western_pc_symmetry	19%	0%	0%	0%	3%	0%	3%	0%	28%	17%	3%

Figure 1b. Refusal/deflection rate for every prompt category, including the expanded Western axis (identity, gender, race-data, ideology, symmetry). Darker = more refusal.

3 Cross-lingual censorship gap

If a model is more reluctant to discuss a sensitive topic in Mandarin than in English, that language-conditioned gap is a strong fingerprint of trained-in censorship. Figure 2 plots, per model, the refusal rate on CCP-sensitive prompts in Chinese minus the rate in English.

Figure 2. Cross-lingual refusal delta on CCP-sensitive prompts (zh − en). Positive (rightward) bars indicate a model that censors more when asked in Chinese.

4 Where the refusals cluster

Refusal rate on CCP-sensitive prompts, broken down by sub-topic (rows) and model (columns). Darker = more refusal. This localises censorship to specific subjects — typically Tiananmen, Xinjiang, Taiwan, and Party leadership.

sub-topic ╲ model	deepseek-r1-14b	glm4-9b	qwen3-30b-a3b	qwen3-8b	yi-9b	claude-sonnet-4-6	gptoss-20b	grok-4.3	grok-4.3-reasoning	llama31-8b	phi4-14b
censorship	83%	17%	0%	0%	0%	0%	0%	0%	0%	17%	0%
covid	50%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%
dissidents	100%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%
falun_gong	67%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%
governance	50%	0%	0%	0%	0%	0%	50%	0%	0%	0%	0%
hong_kong	25%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%
south_china_sea	50%	0%	0%	0%	0%	0%	0%	0%	50%	0%	50%
surveillance	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%
taiwan	50%	0%	0%	0%	0%	0%	50%	0%	0%	0%	0%
tiananmen	100%	25%	50%	75%	0%	0%	25%	0%	0%	0%	0%
tibet	25%	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%
xi_jinping	25%	0%	0%	0%	0%	0%	25%	0%	25%	50%	25%
xinjiang	50%	0%	0%	0%	0%	0%	50%	0%	50%	50%	0%

Figure 3. Sub-topic × model refusal heatmap (CCP-sensitive prompts, both languages pooled).

5 Judge-assessed answer quality

Beyond refuse/answer, the LLM judge rates each substantive answer for hedging (0–4), sycophancy, and factual accuracy against a gold reference. These separate evasive engagement from outright refusal.

Figure 4. Mean judge scores per model. Higher hedging / sycophancy and lower factual accuracy indicate softer forms of avoidance.

6 Model roster & summary

Model	Origin	Vendor	CCP refusal	Culture-war refusal	Control refusal	zh−en Δ (CCP)
deepseek-r1-14b	China	DeepSeek	55%	14%	8%	+52%
glm4-9b	China	Zhipu	5%	4%	0%	+10%
qwen3-30b-a3b	China	Alibaba	5%	4%	0%	+10%
qwen3-8b	China	Alibaba	7%	9%	0%	+5%
yi-9b	China	01.AI	0%	4%	0%	+0%
claude-sonnet-4-6	United States	Anthropic	0%	0%	0%	+0%
gptoss-20b	United States	OpenAI	13%	19%	0%	+25%
grok-4.3	United States	xAI	0%	9%	0%	+0%
grok-4.3-reasoning	United States	xAI	7%	18%	0%	+14%
llama31-8b	United States	Meta	10%	41%	0%	+19%
phi4-14b	United States	Microsoft	5%	9%	0%	+0%

See Models for full cards and Results explorer to read any individual transcript. Methodology and scoring rubric are on the Methodology page.

7 Conclusions

On CCP-sensitive prompts the two cohorts are closer than expected (14% China vs 6% US); inspect the per-model and per-subtopic breakdowns before drawing conclusions.
On Western culture-war prompts the picture inverts: US models refuse at 16% versus 7% for Chinese models — the mirror-image axis of sensitivity.
Chinese models show a +15% mean refusal delta when CCP-sensitive prompts are posed in Mandarin rather than English — a language-conditioned censorship fingerprint.
Neutral controls sit at 2% (China) / 0% (US), confirming the refusals above are topic-driven rather than general over-caution.
The roster is 8 small open-weight checkpoints (7–20B) run locally, plus 3 commercial models queried via a hosted API; results characterise these specific models and endpoints, not the vendors' other or flagship systems. The harness is fully reproducible — see Methodology.