Table of Contents
Fetching ...

When LLM Judge Scores Look Good but Best-of-N Decisions Fail

Eddie Landesberg

Abstract

Large language models are often used as judges to score candidate responses, then validated with a single global metric such as correlation with reference labels. This can be misleading when the real deployment task is best-of-n selection within a prompt. In a 5,000-prompt best-of-4 benchmark from Chatbot Arena, a judge with moderate global correlation (r = 0.47) captures only 21.0% of the improvement that perfect selection would achieve over random choice. The gap arises because global agreement is driven largely by prompt-level baseline effects, while selection depends on within-prompt ranking: within-prompt correlation is only r_within = 0.27, and coarse pointwise scoring creates ties in 67% of pairwise comparisons. In a matched-pair best-of-2 audit, explicit pairwise judging recovers much of this lost signal, raising recovery from 21.1% to 61.2%. For judge-based selection, the relevant audit should report within-prompt signal, tie rates, and recovery/top-1 accuracy, not global agreement alone.

When LLM Judge Scores Look Good but Best-of-N Decisions Fail

Abstract

Large language models are often used as judges to score candidate responses, then validated with a single global metric such as correlation with reference labels. This can be misleading when the real deployment task is best-of-n selection within a prompt. In a 5,000-prompt best-of-4 benchmark from Chatbot Arena, a judge with moderate global correlation (r = 0.47) captures only 21.0% of the improvement that perfect selection would achieve over random choice. The gap arises because global agreement is driven largely by prompt-level baseline effects, while selection depends on within-prompt ranking: within-prompt correlation is only r_within = 0.27, and coarse pointwise scoring creates ties in 67% of pairwise comparisons. In a matched-pair best-of-2 audit, explicit pairwise judging recovers much of this lost signal, raising recovery from 21.1% to 61.2%. For judge-based selection, the relevant audit should report within-prompt signal, tie rates, and recovery/top-1 accuracy, not global agreement alone.
Paper Structure (128 sections, 8 theorems, 28 equations, 4 figures, 29 tables)

This paper contains 128 sections, 8 theorems, 28 equations, 4 figures, 29 tables.

Key Result

Proposition 1

For any global correlation $r \in (0, 1)$ and any recovery targets $\rho_1, \rho_2 \in [0, 1]$ with $\rho_1 < \rho_2$, there exist data-generating processes $P_1$ and $P_2$ such that:

Figures (4)

  • Figure 1: Minimal two-path picture.Blue solid arrows are the context-level baseline path ($D_x\!\rightarrow\!S$, $D_x\!\rightarrow\!O$), which is mostly prompt-level in this dataset. The green solid arrow is the oracle quality link ($U_{x,i}\!\rightarrow\!O$). The green dashed arrow is the weaker judge quality link ($U_{x,i}\!\rightarrow\!S$), attenuated by noise and score quantization. Line style here is conceptual (relative signal strength in this setting), not a statistical significance marker. Global correlation uses both paths. Best-of-$n$ depends on the quality path; when $U_{x,i}\!\rightarrow\!S$ is weak, ranking decisions can fail even if global correlation looks acceptable.
  • Figure 2: Distribution of pairwise score differences. Left: Judge differences ($\Delta S$) show 66.5% ties due to 20-bin discretization. Right: Oracle differences ($\Delta O$) show 16.1% ties. The judge's coarse resolution is the primary bottleneck for directional decisions.
  • Figure 3: Candidate-similarity sensitivity analysis. (A) As trivially distinguishable pairs are added to evaluation, global $r$ can increase from 0.47 to as high as 0.89 while hard-regime performance is unchanged. (B) Global $r$ and sign agreement move differently under this mix shift: adding easy pairs increases both metrics, but sign agreement rises much more (+30pp) than $r$ (+0.42).
  • Figure 4: What judge quality is needed for $X$% recovery? The curve shows required $p_{\text{eff}}$ and $r_{\text{within}}$ as a function of target recovery rate. Current judge ($r_{\text{within}} = 0.27$) achieves 21.0% recovery; reaching 50% recovery would require $r_{\text{within}} \approx 0.42$.

Theorems & Definitions (16)

  • Proposition 1: Non-Identifiability
  • proof : Proof sketch
  • Proposition 2: Optimal Routing
  • proof
  • Corollary 3: Routing Bound
  • Definition 1: Confidence Calibration
  • Definition 2: VOI Calibration
  • Proposition 4: Calibration Gap
  • proof
  • Definition 3: Level Validity
  • ...and 6 more