Table of Contents
Fetching ...

Power Analysis for Prediction-Powered Inference

Yiqun T. Chen, Moran Guo, Shengy Li

Abstract

Modern studies increasingly leverage outcomes predicted by machine learning and artificial intelligence (AI/ML) models, and recent work, such as prediction-powered inference (PPI), has developed valid downstream statistical inference procedures. However, classical power and sample size formulas do not readily account for these predictions. In this work, we tackle a simple yet practical question: given a new AI/ML model with high predictive power, how many labeled samples are needed to achieve a desired level of statistical power? We derive closed-form power formulas by characterizing the asymptotic variance of the PPI estimator and applying Wald test inversion to obtain the required labeled sample size. Our results cover widely used settings including two-sample comparisons and risk measures in 2x2 tables. We find that a useful rule of thumb is that the reduction in required labeled samples relative to classical designs scales roughly with the R2 between the predictions and the ground truth. Our analytical formulas are validated using Monte Carlo simulations, and we illustrate the framework in three contemporary biomedical applications spanning single-cell transcriptomics, clinical blood pressure measurement, and dermoscopy imaging. We provide our software as an R package and online calculators at https://github.com/yiqunchen/pppower.

Power Analysis for Prediction-Powered Inference

Abstract

Modern studies increasingly leverage outcomes predicted by machine learning and artificial intelligence (AI/ML) models, and recent work, such as prediction-powered inference (PPI), has developed valid downstream statistical inference procedures. However, classical power and sample size formulas do not readily account for these predictions. In this work, we tackle a simple yet practical question: given a new AI/ML model with high predictive power, how many labeled samples are needed to achieve a desired level of statistical power? We derive closed-form power formulas by characterizing the asymptotic variance of the PPI estimator and applying Wald test inversion to obtain the required labeled sample size. Our results cover widely used settings including two-sample comparisons and risk measures in 2x2 tables. We find that a useful rule of thumb is that the reduction in required labeled samples relative to classical designs scales roughly with the R2 between the predictions and the ground truth. Our analytical formulas are validated using Monte Carlo simulations, and we illustrate the framework in three contemporary biomedical applications spanning single-cell transcriptomics, clinical blood pressure measurement, and dermoscopy imaging. We provide our software as an R package and online calculators at https://github.com/yiqunchen/pppower.
Paper Structure (67 sections, 8 theorems, 86 equations, 20 figures, 2 tables)

This paper contains 67 sections, 8 theorems, 86 equations, 20 figures, 2 tables.

Key Result

Proposition 1

Under our setup, the PPI estimator is unbiased and asymptotically normal with variance The two terms reflect prediction noise $\sigma_f^2/N$ from the unlabeled sample and residual noise $\sigma_\varepsilon^2/n$ from the labeled sample. When $N$ is large, the variance is dominated by $\sigma_\varepsilon^2/n$. While PPI in eq:ppi-estimator uses $f_i$, the resulting estimator remains unb The asympto

Figures (20)

  • Figure 1: Prediction-powered planning. Panel (a) shows the notation, estimator, and variance inversion. Panel (b) shows the NHANES sample-size plan for $\Delta = 4$ mmHg: gray is the classical design, and the orange curves are the age-only linear model, the richer clinical linear model, and the random forest surrogate.
  • Figure 2: One-sample mean validation: empirical power (points, with 95% Monte Carlo error bars for the binary setting) versus theoretical power (lines) for PPI++ with oracle $\lambda^\star$.
  • Figure 3: Odds-ratio and relative-risk validation in $2\times 2$ tables---analytical (lines) versus empirical (points) PPI++ power for odds ratio (top) and relative risk (bottom). Dashed lines show classical power.
  • Figure 4: Regression-contrast validation: analytical (lines) versus empirical (points) PPI++ power. In panel (b), the empirical logistic-regression points use a two-fold cross-fitted plug-in $\hat{\lambda}$ estimated within each replicate, while the analytical curve uses the large-reference-sample approximation. Dashed lines show the corresponding classical power curves.
  • Figure 5: Real-data planning and held-out validation across three biomedical case studies. Each row corresponds to one application (Baron scRNA-seq, NHANES systolic blood pressure, and ISIC melanoma), and the three columns give a study overview, the planned labeled sample size, and the held-out achieved power.
  • ...and 15 more figures

Theorems & Definitions (8)

  • Proposition 1: Mean PPI++ variance
  • Proposition 2: PPI++ power
  • Proposition 3: PPI++ sample size
  • Corollary 4: Rule of thumb
  • Proposition 5: Two-sample PPI++ variance
  • Proposition 6: Paired PPI++ variance
  • Proposition 7: PPI++ sample size for relative risk and odds ratio
  • Proposition 8: PPI++ sample size for regression contrasts