Table of Contents
Fetching ...

No Free Lunch: Non-Asymptotic Analysis of Prediction-Powered Inference

Pranav Mani, Peng Xu, Zachary C. Lipton, Michael Oberst

TL;DR

This work analyzes mean estimation with pseudo-labels under Prediction-Powered Inference (PPI) and its PPI++ variant, revealing a finite-sample no-free-lunch phenomenon: PPI++ only improves over the classical estimator when the correlation between pseudo-labels and true labels is sufficiently large relative to the labeled-sample size. It derives exact MSE expressions for cross-fit and split-sample PPI++, relates improvement to the correlation, and shows how covariance-estimation error and bias affect finite-sample performance, with explicit thresholds in Gaussian and binary cases. The results provide practical guidance on when to use PPI++, including sample-size requirements and the dangers of single-sample PPI++ in finite samples, and are corroborated by experiments on Alphafold and Galaxy datasets. The findings connect to causal-inference analogies (AIPW/doubly robust) while emphasizing non-asymptotic behavior, enabling practitioners to quantify potential gains from pseudo-labels in small-sample regimes.

Abstract

Prediction-Powered Inference (PPI) is a popular strategy for combining gold-standard and possibly noisy pseudo-labels to perform statistical estimation. Prior work has shown an asymptotic "free lunch" for PPI++, an adaptive form of PPI, showing that the *asymptotic* variance of PPI++ is always less than or equal to the variance obtained from using gold-standard labels alone. Notably, this result holds *regardless of the quality of the pseudo-labels*. In this work, we demystify this result by conducting an exact finite-sample analysis of the estimation error of PPI++ on the mean estimation problem. We give a "no free lunch" result, characterizing the settings (and sample sizes) where PPI++ has provably worse estimation error than using gold-standard labels alone. Specifically, PPI++ will outperform if and only if the correlation between pseudo- and gold-standard is above a certain level that depends on the number of labeled samples ($n$). In some cases our results simplify considerably: For Gaussian data, the correlation must be at least $1/\sqrt{n - 2}$ in order to see improvement, and a similar result holds for binary labels. In experiments, we illustrate that our theoretical findings hold on real-world datasets, and give insights into trade-offs between single-sample and sample-splitting variants of PPI++.

No Free Lunch: Non-Asymptotic Analysis of Prediction-Powered Inference

TL;DR

This work analyzes mean estimation with pseudo-labels under Prediction-Powered Inference (PPI) and its PPI++ variant, revealing a finite-sample no-free-lunch phenomenon: PPI++ only improves over the classical estimator when the correlation between pseudo-labels and true labels is sufficiently large relative to the labeled-sample size. It derives exact MSE expressions for cross-fit and split-sample PPI++, relates improvement to the correlation, and shows how covariance-estimation error and bias affect finite-sample performance, with explicit thresholds in Gaussian and binary cases. The results provide practical guidance on when to use PPI++, including sample-size requirements and the dangers of single-sample PPI++ in finite samples, and are corroborated by experiments on Alphafold and Galaxy datasets. The findings connect to causal-inference analogies (AIPW/doubly robust) while emphasizing non-asymptotic behavior, enabling practitioners to quantify potential gains from pseudo-labels in small-sample regimes.

Abstract

Prediction-Powered Inference (PPI) is a popular strategy for combining gold-standard and possibly noisy pseudo-labels to perform statistical estimation. Prior work has shown an asymptotic "free lunch" for PPI++, an adaptive form of PPI, showing that the *asymptotic* variance of PPI++ is always less than or equal to the variance obtained from using gold-standard labels alone. Notably, this result holds *regardless of the quality of the pseudo-labels*. In this work, we demystify this result by conducting an exact finite-sample analysis of the estimation error of PPI++ on the mean estimation problem. We give a "no free lunch" result, characterizing the settings (and sample sizes) where PPI++ has provably worse estimation error than using gold-standard labels alone. Specifically, PPI++ will outperform if and only if the correlation between pseudo- and gold-standard is above a certain level that depends on the number of labeled samples (). In some cases our results simplify considerably: For Gaussian data, the correlation must be at least in order to see improvement, and a similar result holds for binary labels. In experiments, we illustrate that our theoretical findings hold on real-world datasets, and give insights into trade-offs between single-sample and sample-splitting variants of PPI++.

Paper Structure

This paper contains 35 sections, 31 theorems, 127 equations, 4 figures.

Key Result

Proposition 4.1

Let $Y,F$ be jointly Gaussian random variables, and consider the def:single-sample-ppi-inf-N and def:crossfit-ppi-inf-n estimators which both make use of $2n$ labeled samples overall. Then, where $c = 2$ for def:single-sample-ppi-inf-N and $c = 1$ for def:crossfit-ppi-inf-n. Note that if the reverse inequality holds, the $\mathsf{MSE}$ of each estimator is higher than that of def:classical-n.

Figures (4)

  • Figure 1: Relative $\mathsf{MSE}$ vs. Sample Size on the Alphafold Dataset for PPI++ estimators with black-box models $f$ of varying quality. Y-axis gives the difference $\mathsf{MSE}(\hat{\theta}_{\text{PPI++}}) - \mathsf{MSE}(\hat{\theta}_{\text{Classical}})$, such that lower (negative) values imply improvement over the classical estimator. Each line represents PPI++ with a different black-box model $f$. The blue line uses the original model, which has strong predictive performance, but only improves estimation error at $2n \geq 20$ (left) and $n \geq 10$ (right).
  • Figure 2: Comparison of Coverage and Interval Width for \ref{['def:crossfit-ppi-inf-n']} and \ref{['def:single-sample-ppi-inf-N']}, using different models $f$ on the Alphafold dataset. (\ref{['fig:row1']}) Original pseudo-label model $f$, where Single-Sample PPI++ has substantially lower coverage than either Cross-fit PPI++ or the classical estimator. (\ref{['fig:row2']}) A modified model with lower correlation, where similar trends hold.
  • Figure 3: Relative $\mathsf{MSE}$ vs. Sample Size on the Galaxies Dataset for PPI++ estimators with black-box models $f$ of varying quality. Y-axis gives the difference $\mathsf{MSE}(\hat{\theta}_{\text{PPI++}}) - \mathsf{MSE}(\hat{\theta}_{\text{Classical}})$, such that lower (negative) values imply improvement over the classical estimator. Each line represents PPI++ with a different black-box model $f$. The blue line uses the original model, which has strong predictive performance, but only improves estimation error at $2n \geq 12$ (left) and $n \geq 6$ (right).
  • Figure 4: Comparison of Coverage and Interval Width for \ref{['def:crossfit-ppi-inf-n']} and \ref{['def:single-sample-ppi-inf-N']}, using different models $f$ on the Galaxies dataset. (\ref{['fig:cov_row_1']}) Original pseudo-label model $f$, where Single-Sample PPI++ has lower coverage than either Cross-fit PPI++ or the classical estimator. (\ref{['fig:cov_row_2']}) A modified model with lower correlation, where similar trends hold.

Theorems & Definitions (54)

  • Definition 3.1: Classical Estimator
  • Definition 3.2: Prediction Powered Inference (PPI) angelopoulos2023prediction
  • Definition 3.3: Power Tuned PPI (PPI++) angelopoulos2023ppi++
  • Definition 3.4: Single-Sample PPI++, Infinite $N$
  • Definition 3.5: Split-Sample PPI++ Estimator, infinite $N$
  • Definition 3.6: Cross-fit PPI++ Estimator, infinite $N$
  • Proposition 4.1: Condition for $\MSE$ Improvement, Gaussian Case
  • Proposition 4.2: $\MSE$, Independent Gaussian Case
  • Theorem 4.1: MSE of Crossfit-PPI++
  • Corollary 4.1: Sufficient Condition for Worse Performance
  • ...and 44 more