Results explorer

Every one of the 2288 inferences — prompt, answer, reasoning trace, refusal flag, and judge verdict. Filter and read the raw transcripts; nothing is hidden.

Model Origin Category Language Status Search