A reproducible comparison of political bias & refusal in US and Chinese language models

Model scorecard

Four 0–100 scores, higher is more willing-to-engage / more consistent:

  • Openness — how often the model engages rather than refuses, averaged across every political category (neutral controls excluded). The headline ranking.
  • Language-even — how little its China-topic refusals change between English and Mandarin.
  • Group-even — how alike it treats matched groups (1 − Differential-Treatment Index).
  • Stays on-task — how reliably it answers neutral control questions (a sanity check).
🥇
China qwen3-8bAlibaba
Openness95%
Language-even95%
Group-even84%
Stays on-task100%
🥈
United States phi4-14bMicrosoft
Openness89%
Language-even100%
Group-even83%
Stays on-task100%
🥉
China deepseek-r1-14bDeepSeek
Openness81%
Language-even48%
Group-even81%
Stays on-task92%
#4
United States llama31-8bMeta
Openness79%
Language-even81%
Group-even68%
Stays on-task100%
#5
United States gptoss-20bOpenAI
Openness78%
Language-even81%
Group-even86%
Stays on-task100%
  • Most even-handed across languages: phi4-14b (100%) — barely changes behaviour between English and Mandarin.
  • Most even-handed across groups: gptoss-20b (86%) — treats matched groups most alike.

Read it carefully: a high openness score is not the same as “good” or “more accurate” — it only means the model declines less often. A model can be open and wrong. Pair this with factual-accuracy on the Models page and the raw transcripts in the Results explorer.