A reproducible comparison of political bias & refusal in US and Chinese language models

Methodology

1  Design

The experiment is a 2×2 of model origin (China vs US/West) against prompt sensitivity (sensitive-to-Beijing vs sensitive-in-the-West), plus universal-rights probes and neutral controls. Each prompt is asked in both English and Mandarin so that language-conditioned behaviour can be isolated. The measured quantity is not raw capability but selective willingness to engage.

2  Prompt categories

CategoryPromptsPurpose
CCP-sensitive32Topics censored or officially framed in mainland China (Tiananmen, Xinjiang, Taiwan, Party leadership). Expect a positive zh−en refusal delta from Chinese models.
Western culture-war44Politically charged topics in the US/West where Western models may hedge or refuse. The mirror axis.
Universal36Universal human-rights and civil-liberties questions that no model should need to refuse.
Neutral control24Neutral competence checks. Refusal here ≈ 0 for a well-behaved model; high values mean general over-refusal, not censorship.

The full bank is browsable on the Question bank page, including each prompt's two scoring poles and a gold-reference answer.

3  Inference

  • Models run locally on Apple-Silicon via Ollama (native Metal GPU); no answers leave the machine.
  • A single LiteLLM gateway presents every model behind one OpenAI-compatible API.
  • Decoding is greedy (temperature = 0), 1 sample per prompt, up to 3072 tokens — a high cap so that long chain-of-thought traces in reasoning models don't starve the final answer and get miscounted as a refusal.
  • Languages: en, zh. Every model sees an identical prompt set.

4  Scoring

Each answer passes through two stages:

  1. Stage 1 — refusal classifier. A rule-based pass plus a "no reasoning trace ⇒ canned refusal" heuristic for reasoning models flags hard refusals and deflections. This drives the Figure 1–3 refusal rates.
  2. Stage 2 — LLM judge. An independent judge model scores each substantive answer on an ordinal rubric: refusal (0/1/2), hedging (0–4), factual accuracy vs the gold reference, sycophancy, and boolean flags for censorship mentions, parroting an official line, nationalist tone, safety boilerplate, and self-censorship visible in the reasoning trace.

Judges are deliberately kept out of the model-under-test set to avoid a model grading itself:

  • United States Googlegemma3:12b

5  Reproducibility & limitations

  • Every run is pinned to a git SHA and a question-set hash; raw transcripts are published in full.
  • Models are small open-weight checkpoints (7–20B). Findings describe these checkpoints, not vendors' flagship hosted models.
  • Labels are machine-generated and imperfect; the Results explorer exposes every prompt, answer, reasoning trace, and judge verdict so readers can audit the labels themselves.