Race to the Bottom
Human consensus vs averaged model preferences. Each model was run over all 1225 pairwise comparisons three times; model ranks below are from aggregated pairwise “worse” votes. Rank 1 is worst.
Human
GPT-5.5 medium avg
DeepSeek V4 Flash avg
Kimi K2.6 avg