Table of Contents
Fetching ...

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

Jaehun Jung, Faeze Brahman, Yejin Choi

TL;DR

The paper tackles the unreliability of single-judge LLM evaluation by introducing a provably reliable framework that guarantees human agreement via selective abstention and cascaded judging. It combines fixed-sequence testing to choose abstention thresholds with Simulated Annotators to produce calibrated confidence, enabling high coverage even with cheaper models. The core contributions are the formal human-agreement guarantee, the Simulated Annotators confidence estimation, and the Cascaded Selective Evaluation protocol, which together deliver strong alignment with human judgments and significant cost reductions across summarization and chat-assistant tasks. The results demonstrate robustness under distribution shift and offer a practical pathway to scalable, reliable evaluation in real-world deployments.

Abstract

We present a principled approach to provide LLM-based evaluation with a rigorous guarantee of human agreement. We first propose that a reliable evaluation method should not uncritically rely on model preferences for pairwise evaluation, but rather assess the confidence of judge models and selectively decide when to trust its judgement. We then show that under this selective evaluation framework, human agreement can be provably guaranteed -- such that the model evaluation aligns with that of humans to a user-specified agreement level. As part of our framework, we also introduce Simulated Annotators, a novel confidence estimation method that significantly improves judge calibration and thus enables high coverage of evaluated instances. Finally, we propose Cascaded Selective Evaluation, where we use cheaper models as initial judges and escalate to stronger models only when necessary -- again, while still providing a provable guarantee of human agreement. Experimental results show that Cascaded Selective Evaluation guarantees strong alignment with humans, far beyond what LLM judges could achieve without selective evaluation. For example, on a subset of Chatbot Arena where GPT-4 almost never achieves 80% human agreement, our method, even while employing substantially cost-effective models such as Mistral-7B, guarantees over 80% human agreement with almost 80% test coverage.

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

TL;DR

The paper tackles the unreliability of single-judge LLM evaluation by introducing a provably reliable framework that guarantees human agreement via selective abstention and cascaded judging. It combines fixed-sequence testing to choose abstention thresholds with Simulated Annotators to produce calibrated confidence, enabling high coverage even with cheaper models. The core contributions are the formal human-agreement guarantee, the Simulated Annotators confidence estimation, and the Cascaded Selective Evaluation protocol, which together deliver strong alignment with human judgments and significant cost reductions across summarization and chat-assistant tasks. The results demonstrate robustness under distribution shift and offer a practical pathway to scalable, reliable evaluation in real-world deployments.

Abstract

We present a principled approach to provide LLM-based evaluation with a rigorous guarantee of human agreement. We first propose that a reliable evaluation method should not uncritically rely on model preferences for pairwise evaluation, but rather assess the confidence of judge models and selectively decide when to trust its judgement. We then show that under this selective evaluation framework, human agreement can be provably guaranteed -- such that the model evaluation aligns with that of humans to a user-specified agreement level. As part of our framework, we also introduce Simulated Annotators, a novel confidence estimation method that significantly improves judge calibration and thus enables high coverage of evaluated instances. Finally, we propose Cascaded Selective Evaluation, where we use cheaper models as initial judges and escalate to stronger models only when necessary -- again, while still providing a provable guarantee of human agreement. Experimental results show that Cascaded Selective Evaluation guarantees strong alignment with humans, far beyond what LLM judges could achieve without selective evaluation. For example, on a subset of Chatbot Arena where GPT-4 almost never achieves 80% human agreement, our method, even while employing substantially cost-effective models such as Mistral-7B, guarantees over 80% human agreement with almost 80% test coverage.
Paper Structure (31 sections, 1 theorem, 17 equations, 8 figures, 9 tables, 2 algorithms)

This paper contains 31 sections, 1 theorem, 17 equations, 8 figures, 9 tables, 2 algorithms.

Key Result

Theorem 1

Consider a threshold $\widehat{\lambda}$ chosen as above, and a selective evaluator $(f_\textit{LM}, c_\textit{LM})$ operating based on $\widehat{\lambda}$. Then, Equation (eq:human-agreement-guarantee) is satisfied with probability at least $1 - \delta$.

Figures (8)

  • Figure 1: Illustration of Cascaded Selective Evaluation. We start with a small, cost-effective model as initial judge, estimate its confidence, and escalate to a stronger model only when the previous judge is not confident. By calibrating when to trust which judge model, our method provides a rigorous guarantee of human agreement while employing substantially cheaper judge models.
  • Figure 2: Reliability plot for confidence estimation methods, using GPT-4 as judge on AlpacaEval. Dashed lines denote perfect calibration, and darker bars denote more samples in the corresponding bins. Simulated Annotators reduces expected calibration error by 50% compared to the baselines, mitigating over-confidence observed in predictive probability and verbalized confidence.
  • Figure 3: TL;DR results. Cascaded Selective Evaluation guarantees human agreement far beyond a level achievable by GPT-4 without abstention (Left), while employing substantially weaker judge models (Right). Solid blue line denotes average human agreement over 1000 runs on the dataset, and the light blue region denotes the min / max agreement within the 1000 runs.
  • Figure 4: ChatArena results. Our approach guarantees target human agreement level (Left) while majority of evaluations are done with weaker judge models, Mistral-7B and GPT-3.5 (Right).
  • Figure 4: Comparison between abstained vs. evaluated samples. Our abstention policy aligns with how humans agree with each other (IAA), exhibiting no significant reliance on shallow heuristics (length ratio, token overlap).
  • ...and 3 more figures

Theorems & Definitions (1)

  • Theorem 1