Table of Contents
Fetching ...

Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems

Eddie Landesberg

TL;DR

This work tackles the problem of evaluating long-horizon LLM outcomes when oracle labels are expensive by calibrating cheap judge scores against a small oracle slice and evaluating at scale with auditable uncertainty. The authors introduce Causal Judge Evaluation (CJE), combining AutoCal-R for mean-preserving reward calibration, SIMCal-W for weight stabilization, and Oracle-Uncertainty-Aware (OUA) inference to propagate calibration uncertainty, all within a Design-by-Projection framework grounded in semiparametric efficiency. The approach yields near-nominal CI coverage and high ranking accuracy on a large Arena benchmark, while diagnosing why standard off-policy evaluation (OPE) can fail under limited overlap (the CLE phenomenon). A key contribution is the policy-wise mean-transport test, which makes transportability auditable rather than assumed, enabling safe reuse of calibration across policies and contexts. Practically, CJE enables accurate, cost-effective, and auditable evaluation of diverse LLM policies at production scale, with diagnostics to guide data collection and calibration recalibration when needed.

Abstract

Measuring long-run LLM outcomes (user satisfaction, expert judgment, downstream KPIs) is expensive. Teams default to cheap LLM judges, but uncalibrated proxies can invert rankings entirely. Causal Judge Evaluation (CJE) makes it affordable to aim at the right target: calibrate cheap scores against 5% oracle labels, then evaluate at scale with valid uncertainty. On 4,961 Arena prompts, CJE achieves 99% ranking accuracy at 14x lower cost. Key findings: naive confidence intervals on uncalibrated scores achieve 0% coverage (CJE: ~95%); importance-weighted estimators fail despite 90%+ effective sample size. We introduce the Coverage-Limited Efficiency (CLE) diagnostic explaining why. CJE combines mean-preserving calibration (AutoCal-R), weight stabilization (SIMCal-W), and bootstrap inference that propagates calibration uncertainty (OUA), grounded in semiparametric efficiency theory.

Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems

TL;DR

This work tackles the problem of evaluating long-horizon LLM outcomes when oracle labels are expensive by calibrating cheap judge scores against a small oracle slice and evaluating at scale with auditable uncertainty. The authors introduce Causal Judge Evaluation (CJE), combining AutoCal-R for mean-preserving reward calibration, SIMCal-W for weight stabilization, and Oracle-Uncertainty-Aware (OUA) inference to propagate calibration uncertainty, all within a Design-by-Projection framework grounded in semiparametric efficiency. The approach yields near-nominal CI coverage and high ranking accuracy on a large Arena benchmark, while diagnosing why standard off-policy evaluation (OPE) can fail under limited overlap (the CLE phenomenon). A key contribution is the policy-wise mean-transport test, which makes transportability auditable rather than assumed, enabling safe reuse of calibration across policies and contexts. Practically, CJE enables accurate, cost-effective, and auditable evaluation of diverse LLM policies at production scale, with diagnostics to guide data collection and calibration recalibration when needed.

Abstract

Measuring long-run LLM outcomes (user satisfaction, expert judgment, downstream KPIs) is expensive. Teams default to cheap LLM judges, but uncalibrated proxies can invert rankings entirely. Causal Judge Evaluation (CJE) makes it affordable to aim at the right target: calibrate cheap scores against 5% oracle labels, then evaluate at scale with valid uncertainty. On 4,961 Arena prompts, CJE achieves 99% ranking accuracy at 14x lower cost. Key findings: naive confidence intervals on uncalibrated scores achieve 0% coverage (CJE: ~95%); importance-weighted estimators fail despite 90%+ effective sample size. We introduce the Coverage-Limited Efficiency (CLE) diagnostic explaining why. CJE combines mean-preserving calibration (AutoCal-R), weight stabilization (SIMCal-W), and bootstrap inference that propagates calibration uncertainty (OUA), grounded in semiparametric efficiency theory.

Paper Structure

This paper contains 173 sections, 19 theorems, 63 equations, 9 figures, 10 tables, 8 algorithms.

Key Result

Theorem 1

Let $\phi_{\mathrm{uncon}}$ be the canonical gradient in the nonparametric model that does not use $S$. Then $\phi_{\mathrm{sur}}$ is the canonical gradient in the surrogate model, and $\operatorname{Var}(\phi_{\mathrm{sur}})\le \operatorname{Var}(\phi_{\mathrm{uncon}})$, with strict inequality unle

Figures (9)

  • Figure 1: Judge-as-surrogate under policy and environment shift.$X$: prompt; $A$: response; $S$: judge score; $Y$: oracle label; $\pi$: policy; $E$: environment. $S$ and $Y$ are parallel measurements of the response. Solid arrows: causal structure ($\pi \to A$: policy determines response). Dashed red arrows: mechanism-shift edges indicating the $S$-$Y$ calibration may differ across policies or environments; these are not causal effects of $\pi$ or $E$ on $Y$ holding $(X,A)$ fixed. CJE tests whether calibration transports via $\mathbb{E}[Y - f(S,X)] = 0$.
  • Figure 2: CJE pipeline overview. A small oracle slice (5--25%) provides expensive oracle labels to train a calibration model ($S \to Y$). The learned mapping is then applied to bulk evaluation data where oracle labels are unavailable, enabling policy evaluation at a fraction of the cost. (In experiments, the oracle is gpt-5-2025-08-07; in production this would typically be human raters or downstream KPIs.)
  • Figure 3: AutoCal-R: monotone vs. two-stage calibration. Left: standard isotonic regression enforces monotonicity but cannot capture non-monotonic patterns in $\mathbb{E}[Y \mid S]$. Right: two-stage calibration (spline index $\to$ isotonic) can fit flexible patterns while preserving the mean. AutoCal-R automatically selects the mode via cross-validation.
  • Figure 4: SIMCal-W weight stabilization across Arena policies ($n{=}4{,}961$ samples with complete logprobs for all four target policies). Raw importance weights (blue dots) span $10^{-130}$ to $10^{2}$; $S$-monotone projection (green line) stabilizes weights while preserving unit-mean. ESS improvements range from 4.6$\times$ (clone) to $>$3000$\times$ (parallel_universe_prompt). The premium policy shows weights spanning 130 orders of magnitude before stabilization. Note: \ref{['tab:weight-diagnostics']} reports ablation-averaged ESS across experimental conditions.
  • Figure 5: CJE output: policy value estimates at $n{=}1000$, 25% oracle. Red diamonds show oracle ground truth; blue circles show CJE estimates with 95% CIs. CIs capture the true value for policies satisfying transportability; unhelpful (which violates transportability) shows slight miscoverage, a failure mode flagged by the transportability test (\ref{['fig:transportability-test']}). This is representative of what a practitioner would see when applying CJE to their own evaluation data.
  • ...and 4 more figures

Theorems & Definitions (22)

  • Theorem 1: Surrogate EIF and variance reduction
  • Theorem 2: Efficiency via Model Restriction
  • Corollary 1: Blackwell--efficiency monotonicity
  • Proposition 1: Cal-IPS: mean correctness and dispersion control
  • Theorem 3: DR-CPO: $\sqrt{n}$ limits and efficiency
  • Theorem 4: Budgeted information bound
  • Theorem 5: IF-space stacking
  • Corollary 2: Carathéodory sparsity
  • Proposition 2: OUA jackknife variance estimation
  • Proposition 3: Mean transport equivalence
  • ...and 12 more