Table of Contents
Fetching ...

Multiple-Prediction-Powered Inference

Charlie Cowen-Breen, Alekh Agarwal, Stephen Bates, William W. Cohen, Jacob Eisenstein, Amir Globerson, Adam Fisch

Abstract

Statistical estimation often involves tradeoffs between expensive, high-quality measurements and a variety of lower-quality proxies. We introduce Multiple-Prediction-Powered Inference (MultiPPI): a general framework for constructing statistically efficient estimates by optimally allocating resources across these diverse data sources. This work provides theoretical guarantees about the minimax optimality, finite-sample performance, and asymptotic normality of the MultiPPI estimator. Through experiments across three diverse large language model (LLM) evaluation scenarios, we show that MultiPPI consistently achieves lower estimation error than existing baselines. This advantage stems from its budget-adaptive allocation strategy, which strategically combines subsets of models by learning their complex cost and correlation structures.

Multiple-Prediction-Powered Inference

Abstract

Statistical estimation often involves tradeoffs between expensive, high-quality measurements and a variety of lower-quality proxies. We introduce Multiple-Prediction-Powered Inference (MultiPPI): a general framework for constructing statistically efficient estimates by optimally allocating resources across these diverse data sources. This work provides theoretical guarantees about the minimax optimality, finite-sample performance, and asymptotic normality of the MultiPPI estimator. Through experiments across three diverse large language model (LLM) evaluation scenarios, we show that MultiPPI consistently achieves lower estimation error than existing baselines. This advantage stems from its budget-adaptive allocation strategy, which strategically combines subsets of models by learning their complex cost and correlation structures.

Paper Structure

This paper contains 66 sections, 23 theorems, 147 equations, 17 figures, 1 table.

Key Result

Theorem 2

For all $\Sigma\succ0$, we have where the variance is with respect to any distribution $P\in\mathcal{P}_\Sigma$.

Figures (17)

  • Figure 1: Results by budget for the experiments on Chatbot Arena (a), ProcessBench (b), and Factuality (c). For each estimator (all baselines and MultiPPI), the left column plots the empirical coverage of the 95% CI, the middle column plots the width of the 95% CI, and the right column plots the empirical mean-squared error of the point estimate. The fully-labeled sample size $N$ is 250.
  • Figure 2: Proportion of budget allocated to different models in Experiment 1: ChatBot Arena. Gemini 2.5 Flash, the cheapest model, is most sampled in the low-budget regime, while the proportion of budget allocated to the joint (both models combined) increases monotonically with budget.
  • Figure 3: Proportion of budget allocated to different models in Experiment 2: ProcessBench. Tiny (125 word thinking budget) is most sampled in the low-budget regime, while the proportion of budget allocated to the joint (all models combined) increases monotonically with budget.
  • Figure 4: Linear parameters $\lambda_I$ learned across budget regimes in Experiment 2: ProcessBench. While only the tiny model (125 word thinking budget) has a nonzero linear parameter in the low-budget regime, a cascading behavior is learned in the large-budget regime: the cheaper models are prescribed the opposite sign from the more-expensive models in the joint term.
  • Figure 5: Linear parameters $\lambda_I$ learned across budget regimes in Experiment 1: ChatBot Arena. While only Gemini 2.5 Pro has a nonzero linear parameter in the low-budget regime, a cascading behavior is learned in the large-budget regime: the cheaper model (Gemini 2.5 Flash) is prescribed the opposite sign from the more-expensive model (Gemini 2.5 Pro) in the joint term.
  • ...and 12 more figures

Theorems & Definitions (46)

  • Definition 1
  • Theorem 2: Minimax optimality of MultiPPI for known $\Sigma$
  • Theorem 3
  • Theorem 4: Stability of MultiPPI
  • Theorem 5
  • Corollary 1
  • Corollary 2
  • Corollary 3
  • Theorem 6: Finite-sample bounds specialized to Ledoit-Wolf shrinkage
  • Theorem 7
  • ...and 36 more