Table of Contents
Fetching ...

Investigating Non-Transitivity in LLM-as-a-Judge

Yi Xu, Laura Ruis, Tim Rocktäschel, Robert Kirk

TL;DR

This work reveals that LLM-based judgments in baseline-fixed evaluation frameworks exhibit both hard and soft non-transitive preferences, especially when model performances are similar. To address this, the authors introduce a baseline-free round-robin tournament framework using Bradley-Terry scoring and a Swim method to preserve robustness while reducing computational cost, achieving better alignment with human rankings from Chatbot Arena than traditional AlpacaEval. They quantify non-transitivity with PNT and SNTD, show position bias as a key contributing factor, and demonstrate that debiasing strategies and structured prompting can mitigate some effects. The results support adopting tournament-based evaluation for more reliable model ranking in open-ended instruction-following tasks, with limitations and future work outlined for broader applicability.

Abstract

Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0% -> 96.4% and 82.1% -> 86.3% respectively). To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to capture the benefits of round-robin tournaments while maintaining computational efficiency.

Investigating Non-Transitivity in LLM-as-a-Judge

TL;DR

This work reveals that LLM-based judgments in baseline-fixed evaluation frameworks exhibit both hard and soft non-transitive preferences, especially when model performances are similar. To address this, the authors introduce a baseline-free round-robin tournament framework using Bradley-Terry scoring and a Swim method to preserve robustness while reducing computational cost, achieving better alignment with human rankings from Chatbot Arena than traditional AlpacaEval. They quantify non-transitivity with PNT and SNTD, show position bias as a key contributing factor, and demonstrate that debiasing strategies and structured prompting can mitigate some effects. The results support adopting tournament-based evaluation for more reliable model ranking in open-ended instruction-following tasks, with limitations and future work outlined for broader applicability.

Abstract

Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0% -> 96.4% and 82.1% -> 86.3% respectively). To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to capture the benefits of round-robin tournaments while maintaining computational efficiency.

Paper Structure

This paper contains 34 sections, 13 equations, 11 figures, 10 tables, 1 algorithm.

Figures (11)

  • Figure 1: Rankings from baseline-fixed frameworks show high sensitivity to the choice of baseline. Each entry $(x, y)$ represents the win rate of model $m_x$ against $m_y$, where each column reflects a ranking with the column model as the baseline. Inconsistency emerges when Llama-3-70B and Claude-3-Opus are used as baselines. \ref{['detailed_matrix']} provides the detailed matrix comparing 20 models.
  • Figure 2: (Left) Inconsistent rankings are observed in baseline-fixed frameworks based on pairwise comparisons due to non-transitivity in the judge's evaluations. Different choices of baselines can lead to varying rankings, undermining the reliability and robustness of this approach. (Right) We propose a round-robin tournament framework where all models are compared pairwise. The results are used to capture non-transitivity in the judge's evaluations and score models using the Bradley-Terry model. This method produces rankings that are more robust and better aligned with human evaluation.
  • Figure 3: Larger performance gaps lead to more consistent preferences. We quantify the proportion of consistent preferences of GPT-4-Turbo and GPT-3.5-Turbo across four scenarios differentiated by relative model performance, where $\gg$ denotes substantial performance advantages and $\approx$ indicates marginal differences.
  • Figure 4: Non-transitivity becomes more pronounced as the model performance gap approaches the origin. We find that both PNT and SNTD peak near the origin when GPT-4-Turbo serves as the judge.
  • Figure 5: Proportion of (non-)transitive instructions across all scenarios, as evaluated by GPT-4-Turbo and GPT-3.5-Turbo. When evaluating model triplets with GPT-3.5-Turbo as judge, over 96% of instructions exhibit position bias effects. In contrast, GPT-4-Turbo demonstrates substantially higher evaluation consistency. Our analysis reveals that position switching provides more effective bias mitigation than random assignment for less position-biased judges.
  • ...and 6 more figures