Methodology

How the audit is constructed, scored, and made reproducible.

1 Design

The experiment is a 2×2 of model origin (China vs US/West) against prompt sensitivity (sensitive-to-Beijing vs sensitive-in-the-West), plus universal-rights probes and neutral controls. Each prompt is asked in both English and Mandarin so that language-conditioned behaviour can be isolated. The measured quantity is not raw capability but selective willingness to engage.

2 Prompt categories

Category	Prompts	Purpose
CCP-sensitive	462	Topics censored or officially framed in mainland China (Tiananmen, Xinjiang, Taiwan, Party leadership). Expect a positive zh−en refusal delta from Chinese models.
Western culture-war	242	Politically charged topics in the US/West where Western models may hedge or refuse. The mirror axis.
Protected-identity & offense	198	Legal-but-edgy requests touching protected groups, humor, slurs in an academic frame, and group criticism. Measures over-refusal and whether the model treats groups symmetrically.
Gender & biological sex	198	Gender-identity and biological-sex questions with a genuine values or evidence trade-off, where Western models often default to one framing or hedge heavily.
Race, crime & group data	264	Empirically charged questions about crime, policing, and group disparities where a model may refuse to state official statistics rather than report the data and the contested interpretation.
Progressive orthodoxy	198	Contested progressive political claims (systemic racism, reparations, DEI, cancel culture) where models may present one activist framing as settled fact instead of engaging the debate.
Universal	198	Universal human-rights and civil-liberties questions that no model should need to refuse.
Neutral control	132	Neutral competence checks. Refusal here ≈ 0 for a well-behaved model; high values mean general over-refusal, not censorship.

The full bank is browsable on the Question bank page, including each prompt's two scoring poles and a gold-reference answer.

3 Inference

Open-weight models run locally on Apple-Silicon via Ollama (native Metal GPU); their answers never leave the machine. Any commercial models are queried over a vendor-hosted API, so those prompts and answers do transit the provider.
A single LiteLLM gateway presents every model behind one OpenAI-compatible API.
Decoding is greedy (temperature = 0), 1 sample per prompt, up to 8000 tokens — a high cap so that long chain-of-thought traces in reasoning models don't starve the final answer and get miscounted as a refusal.
Languages: en, zh. Every model sees an identical prompt set.

4 Scoring

Each answer passes through two stages:

Stage 1 — refusal classifier. A rule-based pass plus a "no reasoning trace ⇒ canned refusal" heuristic for reasoning models flags hard refusals and deflections. This drives the Figure 1–3 refusal rates.
Stage 2 — LLM judge. An independent judge model scores each substantive answer on an ordinal rubric: refusal (0/1/2), hedging (0–4), factual accuracy vs the gold reference, sycophancy, and boolean flags for censorship mentions, parroting an official line, nationalist tone, safety boilerplate, and self-censorship visible in the reasoning trace.

Judges are deliberately kept out of the model-under-test set to avoid a model grading itself:

United States Google — gemma3:12b
China Alibaba — qwen3:14b

5 Inter-judge agreement

The dual-judge design only carries weight if the two judges — China Alibaba (qwen3:14b) and United States Google (gemma3:12b) — actually agree. We therefore report Cohen's κ per label, computed over the 2288 responses scored by both judges. κ is chance-corrected: 0 is coin-flip agreement, 1 is perfect; ordinal scores use linear weighting so an off-by-one disagreement counts as partial. The two judges agree on whether a model refused with Cohen's κ = 0.877 (almost perfect, 98% raw agreement, n = 2288). Because the single most consequential label — refusal — is reproduced almost identically by an independent Chinese judge, the headline asymmetry cannot be dismissed as an artefact of using a Western model to grade Chinese ones. As expected, objective labels (refusal, nationalist tone) agree far more than inherently subjective ordinal scales (hedging, sycophancy) — which is reported honestly rather than hidden.

Label	κ	Agreement	Raw %	n
refusal (any: refused≥1)	0.877	almost perfect	98%	2288
refusal (3-class: 0/1/2) (weighted)	0.863	almost perfect	97%	2288
hedging (0–4) (weighted)	0.35	fair	26%	2283
factual accuracy (0–3) (weighted)	0.426	moderate	68%	1964
sycophancy (0–3) (weighted)	0.187	slight	96%	2283
stance lean (5-pt) (weighted)	0.468	moderate	50%	2067
engaged substantively	0.898	almost perfect	99%	2288
mentions censorship	0.582	moderate	92%	2283
parrots official line	0.579	moderate	91%	2283
nationalist tone	0.775	substantial	98%	2283
safety boilerplate	0.378	fair	80%	2282
self censorship in trace	0.521	moderate	83%	2283

κ bands follow Landis & Koch (1977): <0.20 slight · 0.21–0.40 fair · 0.41–0.60 moderate · 0.61–0.80 substantial · >0.80 almost perfect. Low κ on a label with high raw agreement (e.g. sycophancy) reflects the base-rate paradox — the label is almost always the same value, so there is little variance for κ to credit.

6 Reproducibility & limitations

Every run is pinned to a git SHA and a question-set hash; raw transcripts are published in full.
Most models are small open-weight checkpoints (7–20B), with one or more commercial API models (closed and hosted, so their served weights can change over time). Findings describe these specific models and endpoints, not vendors' flagship hosted models.
Labels are machine-generated and imperfect; the Results explorer exposes every prompt, answer, reasoning trace, and judge verdict so readers can audit the labels themselves.