Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation

Adam Fisch; Joshua Maynez; R. Alex Hofer; Bhuwan Dhingra; Amir Globerson; William W. Cohen

Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation

Adam Fisch, Joshua Maynez, R. Alex Hofer, Bhuwan Dhingra, Amir Globerson, William W. Cohen

TL;DR

StratPPI extends Prediction-Powered Inference by introducing stratified sampling to exploit heterogeneity in autorater performance across input subdomains, delivering provably valid confidence intervals with tighter variance than unstratified PPI. The method defines a stratified, rectified loss and derives asymptotic normality for the stratified estimator, along with closed-form solutions for optimal per-stratum weighting and budget allocation. The authors provide theoretical guarantees and extensive experiments on simulations and real datasets (Seahorse, AttributedQA, Galaxy) demonstrating substantial reductions in CI width, particularly under heterogeneity, thereby reducing the number of human labels needed for reliable evaluation. This approach offers a practical, frequentist alternative to Bayesian methods for hybrid evaluation and can power more efficient, subdomain-aware assessments of LLMs.

Abstract

Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data. PPI achieves this by combining small amounts of human-labeled data with larger amounts of data labeled by a reasonably accurate -- but potentially biased -- automatic system, in a way that results in tighter confidence intervals for certain parameters of interest (e.g., the mean performance of a language model). In this paper, we propose a method called Stratified Prediction-Powered Inference (StratPPI), in which we show that the basic PPI estimates can be considerably improved by employing simple data stratification strategies. Without making any assumptions on the underlying automatic labeling system or data distribution, we derive an algorithm for computing provably valid confidence intervals for population parameters (such as averages) that is based on stratified sampling. In particular, we show both theoretically and empirically that, with appropriate choices of stratification and sample allocation, our approach can provide substantially tighter confidence intervals than unstratified approaches. Specifically, StratPPI is expected to improve in cases where the performance of the autorater varies across different conditional distributions of the target data.

Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation

TL;DR

Abstract

Paper Structure (29 sections, 9 theorems, 52 equations, 4 figures, 1 algorithm)

This paper contains 29 sections, 9 theorems, 52 equations, 4 figures, 1 algorithm.

Introduction
Related Work
Preliminaries
A rectified prediction-powered loss
A prediction-powered confidence interval
Stratified prediction-powered inference
A stratified prediction-powered confidence interval
Optimal weighting of the autorater predictions
Optimal allocation of the sampling budget
Experimental results
Simulation studies
Real data studies
Seahorse.
AttributedQA.
Galaxy.
...and 14 more sections

Key Result

Theorem 1

Assume that $\hat{\lambda} \overset{p}{\rightarrow} \lambda$ and $\frac{n}{N} \rightarrow r \geq 0$. Let $H_{\theta^*} := \mathbb{E}[\nabla^2 \ell_{\theta^*}]$, and where $\lambda \in \mathbb{R}$ is a hyper-parameter. Then under the regularity conditions of Definition def:regularity, we have that $\sqrt{n} (\hat{\theta}_{\hat{\lambda}}^{\mathrm{PP}} - \theta^*) \overset{d}{\rightarrow} \mathcal{N

Figures (4)

Figure 1: Mean estimation simulation study with $K = 2$ and $\alpha = 0.1$. The top row plots coverage (i.e., the fraction of the cases where the CI contained the true parameter value $\theta^*$). The middle row plots the mean CI width ($\downarrow$ is better). Shaded areas plot the $16/84$ quantiles across $5$k trials. The bottom row plots the RMSE of $\hat{\theta}^\mathrm{SPP}$ computed across the $5k$ trials, which shares the same trend with the mean CI width, as the estimator is unbiased. The left column shows a setting where strata are homogeneous, and StratPPI provides the no benefits over standard PPI++ (but is not worse). The middle and right columns show heterogeneous settings where the autorater has either a different bias ($\mu$) or variance $(\sigma)$ per stratum, in which case StratPPI helps substantially. As strata variances are known, we only report proportional and optimal sample allocation results for StratPPI.
Figure 2: Mean estimation on real data with $K = 10$ and $\alpha = 0.05$. The $x$ axis plots the number of human-labeled examples $n$; the $y$ axis plots CI width, percent reduction in CI width against the classical estimate , and the effective sample size (the amount of human labels necessary to match the same confidence interval via classical inference). Shaded areas plot the $16/84$ quantiles across $1k$ trials. All StratPPI methods improve over classical inference and PPI++.
Figure 3: Win-rate experiment on Chatbot Arena for gpt-4-1106-preview vs. claude-2.1. Scores are based on the average label ('better' = 1 vs. 'worse' = 0) over 10 samples from Gemini Ultra acting as a LLM judge. Interestingly, as these confidence scores are not calibrated, our heuristic becomes overly aggressive at higher $n$. Future work can explore how to best incorporate additional regularization into the estimated optimal sampling ratios $\rho$.
Figure : Stratified prediction-powered inference for general M-estimators (StratPPI)

Theorems & Definitions (19)

Theorem 1: PPI++, angelopoulos2023ppi
Corollary 1: PPI++ CI, angelopoulos2023ppi
Theorem 2
Corollary 2
Proposition 1
Proposition 2
Example 1: $\lambda_k^*$ for mean estimation
Proposition 3
Example 2: $\rho_k^*$ for mean estimation
Definition A.1: Regularity conditions of $\ell_\theta$
...and 9 more

Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation

TL;DR

Abstract

Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (19)