No Free Lunch: Non-Asymptotic Analysis of Prediction-Powered Inference
Pranav Mani, Peng Xu, Zachary C. Lipton, Michael Oberst
TL;DR
This work analyzes mean estimation with pseudo-labels under Prediction-Powered Inference (PPI) and its PPI++ variant, revealing a finite-sample no-free-lunch phenomenon: PPI++ only improves over the classical estimator when the correlation between pseudo-labels and true labels is sufficiently large relative to the labeled-sample size. It derives exact MSE expressions for cross-fit and split-sample PPI++, relates improvement to the correlation, and shows how covariance-estimation error and bias affect finite-sample performance, with explicit thresholds in Gaussian and binary cases. The results provide practical guidance on when to use PPI++, including sample-size requirements and the dangers of single-sample PPI++ in finite samples, and are corroborated by experiments on Alphafold and Galaxy datasets. The findings connect to causal-inference analogies (AIPW/doubly robust) while emphasizing non-asymptotic behavior, enabling practitioners to quantify potential gains from pseudo-labels in small-sample regimes.
Abstract
Prediction-Powered Inference (PPI) is a popular strategy for combining gold-standard and possibly noisy pseudo-labels to perform statistical estimation. Prior work has shown an asymptotic "free lunch" for PPI++, an adaptive form of PPI, showing that the *asymptotic* variance of PPI++ is always less than or equal to the variance obtained from using gold-standard labels alone. Notably, this result holds *regardless of the quality of the pseudo-labels*. In this work, we demystify this result by conducting an exact finite-sample analysis of the estimation error of PPI++ on the mean estimation problem. We give a "no free lunch" result, characterizing the settings (and sample sizes) where PPI++ has provably worse estimation error than using gold-standard labels alone. Specifically, PPI++ will outperform if and only if the correlation between pseudo- and gold-standard is above a certain level that depends on the number of labeled samples ($n$). In some cases our results simplify considerably: For Gaussian data, the correlation must be at least $1/\sqrt{n - 2}$ in order to see improvement, and a similar result holds for binary labels. In experiments, we illustrate that our theoretical findings hold on real-world datasets, and give insights into trade-offs between single-sample and sample-splitting variants of PPI++.
