Investigating Non-Transitivity in LLM-as-a-Judge
Yi Xu, Laura Ruis, Tim Rocktäschel, Robert Kirk
TL;DR
This work reveals that LLM-based judgments in baseline-fixed evaluation frameworks exhibit both hard and soft non-transitive preferences, especially when model performances are similar. To address this, the authors introduce a baseline-free round-robin tournament framework using Bradley-Terry scoring and a Swim method to preserve robustness while reducing computational cost, achieving better alignment with human rankings from Chatbot Arena than traditional AlpacaEval. They quantify non-transitivity with PNT and SNTD, show position bias as a key contributing factor, and demonstrate that debiasing strategies and structured prompting can mitigate some effects. The results support adopting tournament-based evaluation for more reliable model ranking in open-ended instruction-following tasks, with limitations and future work outlined for broader applicability.
Abstract
Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0% -> 96.4% and 82.1% -> 86.3% respectively). To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to capture the benefits of round-robin tournaments while maintaining computational efficiency.
