LLM-as-Judge on a Budget

Aadirupa Saha; Aniket Wagde; Branislav Kveton

LLM-as-Judge on a Budget

Aadirupa Saha, Aniket Wagde, Branislav Kveton

TL;DR

A principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities is presented, establishing a theoretical foundation for efficient LLM evaluation with practical implications for AI safety, model alignment, and automated assessment at scale.

Abstract

LLM-as-a-judge has emerged as a cornerstone technique for evaluating large language models by leveraging LLM reasoning to score prompt-response pairs. Since LLM judgments are stochastic, practitioners commonly query each pair multiple times to estimate mean scores accurately. This raises a critical challenge: given a fixed computational budget $B$, how to optimally allocate queries across $K$ prompt-response pairs to minimize estimation error? % We present a principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities. Our method dynamically allocates queries based on estimated score variances, concentrating resources where uncertainty is highest. Further, our algorithm is shown to achieve a worst-case score-estimation error of $\tilde{O}\left(\sqrt{\frac{\sum_{i=1}^K σ_i^2}{B}}\right)$, $σ_i^2$ being the unknown score variance for pair $i \in [K]$ with near-optimal budget allocation. % Experiments on \emph{Summarize-From-Feedback} and \emph{HelpSteer2} demonstrate that our method significantly outperforms uniform allocation, reducing worst-case estimation error while maintaining identical budgets. Our work establishes a theoretical foundation for efficient LLM evaluation with practical implications for AI safety, model alignment, and automated assessment at scale.

LLM-as-Judge on a Budget

TL;DR

Abstract

, how to optimally allocate queries across

prompt-response pairs to minimize estimation error? % We present a principled variance-adaptive approach leveraging multi-armed bandit theory and concentration inequalities. Our method dynamically allocates queries based on estimated score variances, concentrating resources where uncertainty is highest. Further, our algorithm is shown to achieve a worst-case score-estimation error of

being the unknown score variance for pair

with near-optimal budget allocation. % Experiments on \emph{Summarize-From-Feedback} and \emph{HelpSteer2} demonstrate that our method significantly outperforms uniform allocation, reducing worst-case estimation error while maintaining identical budgets. Our work establishes a theoretical foundation for efficient LLM evaluation with practical implications for AI safety, model alignment, and automated assessment at scale.

Paper Structure (27 sections, 13 theorems, 27 equations, 9 figures, 3 tables, 2 algorithms)

This paper contains 27 sections, 13 theorems, 27 equations, 9 figures, 3 tables, 2 algorithms.

Introduction
Problem Setup
Problem Setting.
Budget Constraint Optimal Allocation
Objective and Performance Metric
Warm-Up: Optimal Allocation with Known Variance
ROBIN: Algorithm Description
Performance Analysis of \ref{['alg:algk']}
Main Algorithm: Near-Optimal Allocation with Unknown Variance
ROBIN-HOOD: Algorithm Description
Performance Analysis of \ref{['alg:algu']}
Experiments
Experimental Setup
Inference from the Empirical Results
Conclusions
...and 12 more sections

Key Result

Lemma 1

Let $\lambda_{i} = \frac{\sigma_i^2}{\sum_{j \in [K]} \sigma_j^2}.$ Then ROBIN pulls Arm-$i$ for at least $\lfloor {\lambda_{i}B} \rfloor$ many times and at most $\lceil {\lambda_{i}B} \rceil$ many times, for all $i \in [K]$, i.e. $n_i({\mathcal{R}}) \in ( \lfloor {\lambda_{i}B} \rfloor, \lceil

Figures (9)

Figure 1: Histogram of score variance of GPT-4.1 nano on helpfulness attribute
Figure 2: Histogram of mean scores of GPT-4.1 nano on helpfulness attribute
Figure 3: Maximum error, using GPT-4.1 nano, $\delta$=0.007, warm-up period: 20164 samples
Figure 4: Maximum error, using GPT-4.1 nano, $\delta$=0.007, warm-up period: 20164 samples
Figure 5: Maximum error, using GPT-4.1 nano, $\delta$=0.07, warm-up period: 10807 samples
...and 4 more figures

Theorems & Definitions (23)

Definition 1: Optimal Query Allocation for LLM-as-Judge
Lemma 1: Allocation Profile of
Theorem 2: Performance Analysis of ROBIN
proof : Proof Sketch of \ref{['thm:known']}
Theorem 3: sub-Gaussian Concentration-Inequality lattimore19bandit
Remark 1: Relaxation of the Noise Assumption
Theorem 4: Performance Analysis of
proof : Proof Sketch of \ref{['thm:unknown']}
Lemma 4: Estimated Variance Concentration
Lemma 4: Allocation Profile of (\ref{['alg:algu']})
...and 13 more

LLM-as-Judge on a Budget

TL;DR

Abstract

LLM-as-Judge on a Budget

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (23)