Table of Contents
Fetching ...

Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas

Luke Guerdan, Justin Whitehouse, Kimberly Truong, Kenneth Holstein, Zhiwei Steven Wu

TL;DR

This work proposes a doubly-robust estimation framework that provides a principled foundation for combining imperfect persona ratings with human ratings observed under sampling bias to obtain valid system quality estimates.

Abstract

As Generative AI (GenAI) systems see growing adoption, a key concern involves the external validity of evaluations, or the extent to which they generalize from lab-based to real-world deployment conditions. Threats to the external validity of GenAI evaluations arise when the source sample of human raters and system outputs used to obtain a system quality estimate differs from the target distribution at deployment time. In this work, we propose a doubly-robust estimation framework designed to address this evaluation sampling bias. Key to our approach is the use of "persona" ratings produced by prompting an LLM evaluator (i.e., an LLM-as-a-judge) to behave as a human rater with specific sociodemographic characteristics. Our doubly-robust framework combines these informative yet imperfect persona ratings with human ratings obtained under evaluation sampling bias to produce statistically valid system quality estimates. In particular, we show that our approach yields valid system quality estimates when either (i) a model trained to predict human ratings using persona ratings and source data observed under sampling bias, or (ii) a reweighting model that corrects for sampling bias is of sufficient quality. We validate our framework theoretically and via a novel Persona Simulation Framework (PSF) designed to systematically manipulate persona quality and the degree of evaluation sampling bias present in source data. Our work provides a principled foundation for combining imperfect persona ratings with human ratings observed under sampling bias to obtain valid system quality estimates.

Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas

TL;DR

This work proposes a doubly-robust estimation framework that provides a principled foundation for combining imperfect persona ratings with human ratings observed under sampling bias to obtain valid system quality estimates.

Abstract

As Generative AI (GenAI) systems see growing adoption, a key concern involves the external validity of evaluations, or the extent to which they generalize from lab-based to real-world deployment conditions. Threats to the external validity of GenAI evaluations arise when the source sample of human raters and system outputs used to obtain a system quality estimate differs from the target distribution at deployment time. In this work, we propose a doubly-robust estimation framework designed to address this evaluation sampling bias. Key to our approach is the use of "persona" ratings produced by prompting an LLM evaluator (i.e., an LLM-as-a-judge) to behave as a human rater with specific sociodemographic characteristics. Our doubly-robust framework combines these informative yet imperfect persona ratings with human ratings obtained under evaluation sampling bias to produce statistically valid system quality estimates. In particular, we show that our approach yields valid system quality estimates when either (i) a model trained to predict human ratings using persona ratings and source data observed under sampling bias, or (ii) a reweighting model that corrects for sampling bias is of sufficient quality. We validate our framework theoretically and via a novel Persona Simulation Framework (PSF) designed to systematically manipulate persona quality and the degree of evaluation sampling bias present in source data. Our work provides a principled foundation for combining imperfect persona ratings with human ratings observed under sampling bias to obtain valid system quality estimates.

Paper Structure

This paper contains 38 sections, 9 theorems, 66 equations, 14 figures, 3 tables, 3 algorithms.

Key Result

Theorem 3.1

Assume the learner has access to samples $Z_1^s, \dots, Z_{N_s}^s \sim P_s$ and $Z_1^t, \dots, Z_{N_t}^t \sim P_t$ satisfying Assumptions ass:full- ass:observed and the assumptions of Theorems thm:no-split-mean and thm:cross-fit-mean (all outlined in Appendix app:theory). Then, letting $\widehat{\t In particular, this implies that, for any $\delta \in (0 , 1)$, the set is a $1 - \delta$ confiden

Figures (14)

  • Figure 1: Comparison of our doubly-robust estimator with baselines on three datasets from our Persona Simulation Framework. Red and black dashed lines denote the true source and target mean ratings, respectively (e.g., the average "helpfulness" rating obtained over source vs. target distributions). Persona-Based directly averages persona ratings to compute a system quality estimate. Sample Average produces a system quality estimate by averaging human ratings sampled from the source distribution. PPI++ angelopoulos2023ppipp and RePPI ji2025predictions are two state-of-the-art statistical methods that do not account for evaluation sampling bias. Across settings, we observe that our Doubly-Robust (Riesz) estimator yields improved coverage and lower bias than baselines, while maintaining informative confidence intervals.
  • Figure 2: Our framework produces estimates for the target parameter $\theta_t$ using (i) complete rating tuples from the source distribution (blue, left), (ii) unlabeled samples from the target distribution (yellow, right), and (iii) persona ratings produced for both source and target samples (red, top). Evaluation sampling bias may arise both from the covariate shift of $(V, X)$ from $P_s$ to $P_t$, and from selection bias in which rating completion $C$ is non-random in $P_s$ -- i.e., $C \not\mathrel{\hbox{$\perp$}\mkern2mu{\perp}} (V,X)$.
  • Figure 3: Coverage by persona quality (top), covariate shift (center), and selection bias (bottom). DR (Riesz) attains better coverage than all baselines. Baselines with 0% coverage omitted to reduce clutter. $\eta = 0.1$; $\rho=0.6$ for bottom two rows. Fig. \ref{['fig:full_synthetic']}–\ref{['fig:full_dices']} (Appendix \ref{['appendix:experiment_details']}) presents analogous results for Bias (MAE) and Interval Width.
  • Figure 4: Average Bias (MAE), Coverage, and Interval Width across experimental conditions presented in Fig. \ref{['fig:main_1D_plots']}. Values in parentheses denote standard error (values $<0.01$ omitted to reduce clutter).
  • Figure 5: Coverage of DR (Riesz) (solid) versus RePPI (dashed) when varying covariate shift (left) and selection bias (right) with persona ratings from different LLM judges. Parentheses denote Pearson correlation between persona and human ratings.
  • ...and 9 more figures

Theorems & Definitions (14)

  • Theorem 3.1
  • Lemma B.1
  • proof
  • Theorem B.2
  • Corollary B.3
  • Theorem B.4
  • Theorem C.1
  • Remark C.2
  • Corollary C.3
  • Theorem C.4
  • ...and 4 more