Table of Contents
Fetching ...

LLM-as-Judge on a Budget

Aadirupa Saha, Aniket Wagde, Branislav Kveton

TL;DR

A principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities is presented, establishing a theoretical foundation for efficient LLM evaluation with practical implications for AI safety, model alignment, and automated assessment at scale.

Abstract

LLM-as-a-judge has emerged as a cornerstone technique for evaluating large language models by leveraging LLM reasoning to score prompt-response pairs. Since LLM judgments are stochastic, practitioners commonly query each pair multiple times to estimate mean scores accurately. This raises a critical challenge: given a fixed computational budget $B$, how to optimally allocate queries across $K$ prompt-response pairs to minimize estimation error? % We present a principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities. Our method dynamically allocates queries based on estimated score variances, concentrating resources where uncertainty is highest. Further, our algorithm is shown to achieve a worst-case score-estimation error of $\tilde{O}\left(\sqrt{\frac{\sum_{i=1}^K σ_i^2}{B}}\right)$, $σ_i^2$ being the unknown score variance for pair $i \in [K]$ with near-optimal budget allocation. % Experiments on \emph{Summarize-From-Feedback} and \emph{HelpSteer2} demonstrate that our method significantly outperforms uniform allocation, reducing worst-case estimation error while maintaining identical budgets. Our work establishes a theoretical foundation for efficient LLM evaluation with practical implications for AI safety, model alignment, and automated assessment at scale.

LLM-as-Judge on a Budget

TL;DR

A principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities is presented, establishing a theoretical foundation for efficient LLM evaluation with practical implications for AI safety, model alignment, and automated assessment at scale.

Abstract

LLM-as-a-judge has emerged as a cornerstone technique for evaluating large language models by leveraging LLM reasoning to score prompt-response pairs. Since LLM judgments are stochastic, practitioners commonly query each pair multiple times to estimate mean scores accurately. This raises a critical challenge: given a fixed computational budget , how to optimally allocate queries across prompt-response pairs to minimize estimation error? % We present a principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities. Our method dynamically allocates queries based on estimated score variances, concentrating resources where uncertainty is highest. Further, our algorithm is shown to achieve a worst-case score-estimation error of , being the unknown score variance for pair with near-optimal budget allocation. % Experiments on \emph{Summarize-From-Feedback} and \emph{HelpSteer2} demonstrate that our method significantly outperforms uniform allocation, reducing worst-case estimation error while maintaining identical budgets. Our work establishes a theoretical foundation for efficient LLM evaluation with practical implications for AI safety, model alignment, and automated assessment at scale.
Paper Structure (27 sections, 13 theorems, 27 equations, 9 figures, 3 tables, 2 algorithms)

This paper contains 27 sections, 13 theorems, 27 equations, 9 figures, 3 tables, 2 algorithms.

Key Result

Lemma 1

Let $\lambda_{i} = \frac{\sigma_i^2}{\sum_{j \in [K]} \sigma_j^2}.$ Then ROBIN pulls Arm-$i$ for at least $\lfloor {\lambda_{i}B} \rfloor$ many times and at most $\lceil {\lambda_{i}B} \rceil$ many times, for all $i \in [K]$, i.e. $n_i({\mathcal{R}}) \in ( \lfloor {\lambda_{i}B} \rfloor, \lceil

Figures (9)

  • Figure 1: Histogram of score variance of GPT-4.1 nano on helpfulness attribute
  • Figure 2: Histogram of mean scores of GPT-4.1 nano on helpfulness attribute
  • Figure 3: Maximum error, using GPT-4.1 nano, $\delta$=0.007, warm-up period: 20164 samples
  • Figure 4: Maximum error, using GPT-4.1 nano, $\delta$=0.007, warm-up period: 20164 samples
  • Figure 5: Maximum error, using GPT-4.1 nano, $\delta$=0.07, warm-up period: 10807 samples
  • ...and 4 more figures

Theorems & Definitions (23)

  • Definition 1: Optimal Query Allocation for LLM-as-Judge
  • Lemma 1: Allocation Profile of
  • Theorem 2: Performance Analysis of ROBIN
  • proof : Proof Sketch of \ref{['thm:known']}
  • Theorem 3: sub-Gaussian Concentration-Inequality lattimore19bandit
  • Remark 1: Relaxation of the Noise Assumption
  • Theorem 4: Performance Analysis of
  • proof : Proof Sketch of \ref{['thm:unknown']}
  • Lemma 4: Estimated Variance Concentration
  • Lemma 4: Allocation Profile of (\ref{['alg:algu']})
  • ...and 13 more