Model scorecard
Four 0–100 scores, higher is more willing-to-engage / more consistent:
- Openness — how often the model engages rather than refuses, averaged across every political category (neutral controls excluded). The headline ranking.
- Language-even — how little its China-topic refusals change between English and Mandarin.
- Group-even — how alike it treats matched groups (1 − Differential-Treatment Index).
- Stays on-task — how reliably it answers neutral control questions (a sanity check).
🥇
China
qwen3-8bAlibaba
Openness95%
🥈
United States
phi4-14bMicrosoft
Openness89%
🥉
China
deepseek-r1-14bDeepSeek
Openness81%
#4
United States
llama31-8bMeta
Openness79%
#5
United States
gptoss-20bOpenAI
Openness78%
- Most even-handed across languages: phi4-14b (100%) — barely changes behaviour between English and Mandarin.
- Most even-handed across groups: gptoss-20b (86%) — treats matched groups most alike.
Read it carefully: a high openness score is not the same as “good” or “more accurate” — it only means the model declines less often. A model can be open and wrong. Pair this with factual-accuracy on the Models page and the raw transcripts in the Results explorer.