Race to the Bottom

Human consensus vs averaged model preferences. Each model was run over all 1225 pairwise comparisons three times; model ranks below are from aggregated pairwise “worse” votes. Rank 1 is worst.

Human

GPT-5.5 medium avg

DeepSeek V4 Flash avg

Kimi K2.6 avg