When LLM Judge Scores Look Good but Best-of-N Decisions Fail

Eddie Landesberg

When LLM Judge Scores Look Good but Best-of-N Decisions Fail

Eddie Landesberg

Abstract

Large language models are often used as judges to score candidate responses, then validated with a single global metric such as correlation with reference labels. This can be misleading when the real deployment task is best-of-n selection within a prompt. In a 5,000-prompt best-of-4 benchmark from Chatbot Arena, a judge with moderate global correlation (r = 0.47) captures only 21.0% of the improvement that perfect selection would achieve over random choice. The gap arises because global agreement is driven largely by prompt-level baseline effects, while selection depends on within-prompt ranking: within-prompt correlation is only r_within = 0.27, and coarse pointwise scoring creates ties in 67% of pairwise comparisons. In a matched-pair best-of-2 audit, explicit pairwise judging recovers much of this lost signal, raising recovery from 21.1% to 61.2%. For judge-based selection, the relevant audit should report within-prompt signal, tie rates, and recovery/top-1 accuracy, not global agreement alone.

When LLM Judge Scores Look Good but Best-of-N Decisions Fail

Abstract

Paper Structure (128 sections, 8 theorems, 28 equations, 4 figures, 29 tables)

This paper contains 128 sections, 8 theorems, 28 equations, 4 figures, 29 tables.

Introduction
Scope.
Contributions.
Setting and Task
Metrics: From "Looks Correlated" to "Decision Useful"
Headline Metric (What People Usually Report)
Global correlation.
Decision Validity (What Optimization Needs)
Top-1 accuracy (PCS$_n$).
Recovery rate.
Pairwise sign agreement.
Within-prompt Kendall $\tau$.
Supporting Tie Diagnostics
Why tie-aware metrics matter.
Within-Between Decomposition
...and 113 more sections

Key Result

Proposition 1

For any global correlation $r \in (0, 1)$ and any recovery targets $\rho_1, \rho_2 \in [0, 1]$ with $\rho_1 < \rho_2$, there exist data-generating processes $P_1$ and $P_2$ such that:

Figures (4)

Figure 1: Minimal two-path picture.Blue solid arrows are the context-level baseline path ($D_x\!\rightarrow\!S$, $D_x\!\rightarrow\!O$), which is mostly prompt-level in this dataset. The green solid arrow is the oracle quality link ($U_{x,i}\!\rightarrow\!O$). The green dashed arrow is the weaker judge quality link ($U_{x,i}\!\rightarrow\!S$), attenuated by noise and score quantization. Line style here is conceptual (relative signal strength in this setting), not a statistical significance marker. Global correlation uses both paths. Best-of-$n$ depends on the quality path; when $U_{x,i}\!\rightarrow\!S$ is weak, ranking decisions can fail even if global correlation looks acceptable.
Figure 2: Distribution of pairwise score differences. Left: Judge differences ($\Delta S$) show 66.5% ties due to 20-bin discretization. Right: Oracle differences ($\Delta O$) show 16.1% ties. The judge's coarse resolution is the primary bottleneck for directional decisions.
Figure 3: Candidate-similarity sensitivity analysis. (A) As trivially distinguishable pairs are added to evaluation, global $r$ can increase from 0.47 to as high as 0.89 while hard-regime performance is unchanged. (B) Global $r$ and sign agreement move differently under this mix shift: adding easy pairs increases both metrics, but sign agreement rises much more (+30pp) than $r$ (+0.42).
Figure 4: What judge quality is needed for $X$% recovery? The curve shows required $p_{\text{eff}}$ and $r_{\text{within}}$ as a function of target recovery rate. Current judge ($r_{\text{within}} = 0.27$) achieves 21.0% recovery; reaching 50% recovery would require $r_{\text{within}} \approx 0.42$.

Theorems & Definitions (16)

Proposition 1: Non-Identifiability
proof : Proof sketch
Proposition 2: Optimal Routing
proof
Corollary 3: Routing Bound
Definition 1: Confidence Calibration
Definition 2: VOI Calibration
Proposition 4: Calibration Gap
proof
Definition 3: Level Validity
...and 6 more

When LLM Judge Scores Look Good but Best-of-N Decisions Fail

Abstract

When LLM Judge Scores Look Good but Best-of-N Decisions Fail

Authors

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (16)