Table of Contents
Fetching ...

Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data

Florian E. Dorner, Vivian Y. Nastl, Moritz Hardt

TL;DR

This paper analyzes the limits of using large-language-model judges for scalable evaluation, particularly at the frontier where evaluated models may outperform the judge. It formalizes a binary evaluation framework and demonstrates that judge bias can significantly distort model rankings. The authors introduce Prediction Powered Inference (PPI) to debias proxy judgments and prove that, in the frontier setting, the best possible data-efficiency gain is at most a factor of two, with empirical results on MMLU, MT-Bench, and TruthfulQA supporting modest gains. The work further extends to non-binary proxies and shows that while soft judgments can improve efficiency, they do not overturn the fundamental frontier bound, signaling limited practical gains for LLM-as-a-judge in high-signal evaluation tasks.

Abstract

High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use strong existing models in lieu of costly labels to provide cheap model evaluations. Unfortunately, this method of using models as judges introduces biases, such as self-preferencing, that can distort model comparisons. An emerging family of debiasing tools promises to fix these issues by using a few high quality labels to debias a large number of model judgments. In this paper, we study how far such debiasing methods, in principle, can go. Our main result shows that when the judge is no more accurate than the evaluated model, no debiasing method can decrease the required amount of ground truth labels by more than half. Our result speaks to the severe limitations of the LLM-as-a-judge paradigm at the evaluation frontier where the goal is to assess newly released models that are possibly better than the judge. Through an empirical evaluation, we demonstrate that the sample size savings achievable in practice are even more modest than what our theoretical limit suggests. Along the way, our work provides new observations about debiasing methods for model evaluation, and points out promising avenues for future work.

Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data

TL;DR

This paper analyzes the limits of using large-language-model judges for scalable evaluation, particularly at the frontier where evaluated models may outperform the judge. It formalizes a binary evaluation framework and demonstrates that judge bias can significantly distort model rankings. The authors introduce Prediction Powered Inference (PPI) to debias proxy judgments and prove that, in the frontier setting, the best possible data-efficiency gain is at most a factor of two, with empirical results on MMLU, MT-Bench, and TruthfulQA supporting modest gains. The work further extends to non-binary proxies and shows that while soft judgments can improve efficiency, they do not overturn the fundamental frontier bound, signaling limited practical gains for LLM-as-a-judge in high-signal evaluation tasks.

Abstract

High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use strong existing models in lieu of costly labels to provide cheap model evaluations. Unfortunately, this method of using models as judges introduces biases, such as self-preferencing, that can distort model comparisons. An emerging family of debiasing tools promises to fix these issues by using a few high quality labels to debias a large number of model judgments. In this paper, we study how far such debiasing methods, in principle, can go. Our main result shows that when the judge is no more accurate than the evaluated model, no debiasing method can decrease the required amount of ground truth labels by more than half. Our result speaks to the severe limitations of the LLM-as-a-judge paradigm at the evaluation frontier where the goal is to assess newly released models that are possibly better than the judge. Through an empirical evaluation, we demonstrate that the sample size savings achievable in practice are even more modest than what our theoretical limit suggests. Along the way, our work provides new observations about debiasing methods for model evaluation, and points out promising avenues for future work.

Paper Structure

This paper contains 42 sections, 17 theorems, 83 equations, 8 figures.

Key Result

Proposition 1

Consider a binary classifier $\tilde{m}$ and a set of strictly better binary classifiers $\mathcal{M}$ such that $\tilde{m}(x) = y(x)$ implies $m_i(x)=y(x)$ for all $m_i\in\mathcal{M}$. Let $\mathop{\mathrm{\mathbb{E}}}\limits s(m_i)$ represent the accuracy of model $m_i$ evaluated on the correct la

Figures (8)

  • Figure 1: Model ranks on MMLU based on true labels compared to LLM labels in a semi-synthetic setting (a) and for the top-10 models on HELM by July 2024 (b). Using LLM labels heavily perturbs the ranking despite high judge accuracy.
  • Figure 2: Best possible sample efficiency factor $\tau_{\max}$ according to Theorem \ref{['thm:Cramer-Rao']}, using different judges (colors) evaluating different models (x-ticks) on MMLU. Error bars show $90\%$ confidence intervals. Sample efficiency gains stay below two, unless SOTA models are used to evaluate weak models.
  • Figure 3: Best possible sample efficiency factor $\tau_{\max}$ according to Theorem \ref{['thm:Cramer-Rao']}, using GPT-4-as-a-Judge, evaluating different models on MT-Bench. Error bars show $90\%$ confidence intervals. Sample efficiency get close to two in some cases, but consistently stay below that value.
  • Figure 4: Sample efficiency factor $\tau(\hat{\theta}_{\lambda^*}^{PP})$ for PPI in the $N\to \infty$ limit, using LLama3.1-405B-as-Judge on MMLU, with binary and non-binary scores. Error bars: $90\%$ confidence intervals. Non-binary scores improve the sample efficiency factor $\tau(\hat{\theta}_{\lambda^*}^{PP})$, but it stays below two at the frontier.
  • Figure 5: Best possible sample efficiency factor $\tau_{\max}$ and lower/upper bounds based on the balanced accuracy $\mathop{\mathrm{\mathrm{BA}}}\limits$ for different judges (color) and evaluated models (x-ticks).
  • ...and 3 more figures

Theorems & Definitions (29)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • proof
  • Proposition 4
  • Theorem 5
  • Theorem 6
  • Corollary 7
  • Proposition 8
  • Proposition 9
  • ...and 19 more