Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems

Eddie Landesberg

Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems

Eddie Landesberg

TL;DR

This work tackles the problem of evaluating long-horizon LLM outcomes when oracle labels are expensive by calibrating cheap judge scores against a small oracle slice and evaluating at scale with auditable uncertainty. The authors introduce Causal Judge Evaluation (CJE), combining AutoCal-R for mean-preserving reward calibration, SIMCal-W for weight stabilization, and Oracle-Uncertainty-Aware (OUA) inference to propagate calibration uncertainty, all within a Design-by-Projection framework grounded in semiparametric efficiency. The approach yields near-nominal CI coverage and high ranking accuracy on a large Arena benchmark, while diagnosing why standard off-policy evaluation (OPE) can fail under limited overlap (the CLE phenomenon). A key contribution is the policy-wise mean-transport test, which makes transportability auditable rather than assumed, enabling safe reuse of calibration across policies and contexts. Practically, CJE enables accurate, cost-effective, and auditable evaluation of diverse LLM policies at production scale, with diagnostics to guide data collection and calibration recalibration when needed.

Abstract

Measuring long-run LLM outcomes (user satisfaction, expert judgment, downstream KPIs) is expensive. Teams default to cheap LLM judges, but uncalibrated proxies can invert rankings entirely. Causal Judge Evaluation (CJE) makes it affordable to aim at the right target: calibrate cheap scores against 5% oracle labels, then evaluate at scale with valid uncertainty. On 4,961 Arena prompts, CJE achieves 99% ranking accuracy at 14x lower cost. Key findings: naive confidence intervals on uncalibrated scores achieve 0% coverage (CJE: ~95%); importance-weighted estimators fail despite 90%+ effective sample size. We introduce the Coverage-Limited Efficiency (CLE) diagnostic explaining why. CJE combines mean-preserving calibration (AutoCal-R), weight stabilization (SIMCal-W), and bootstrap inference that propagates calibration uncertainty (OUA), grounded in semiparametric efficiency theory.

Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems

TL;DR

Abstract

Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (22)