Table of Contents
Fetching ...

Prediction-Powered Inference with Imputed Covariates and Nonuniform Sampling

Dan M. Kluger, Kerri Lu, Tijana Zrnic, Sherrie Wang, Stephen Bates

TL;DR

This work tackles the challenge of performing valid inference when downstream analyses use machine-learning predictions for covariates. It introduces Predict-Then-Debias (PTD), a simple yet powerful estimator that combines a biased, high-volume estimator with a bias-correction term from a small complete sample, and generalizes it with an optimal tuning matrix $\hat{\Omega}$ to boost efficiency; the key asymptotic form is $\sqrt{N}(\hat{\theta}^{\textnormal{PTD},\hat{\Omega}} - \theta) \rightsquigarrow \mathcal{N}(0, \Sigma_{\text{PTD}}(\Omega))$, where $\Sigma_{\text{PTD}}(\Omega)$ is a function of the covariances between the estimators. The paper develops bootstrap-based confidence intervals that remain valid under nonuniform two-phase sampling and arbitrary imputation, including extensions to cluster and stratified designs, and introduces a faster convolution-based bootstrap to speed up computation. It also derives optimal tuning matrices (full and diagonal) that minimize asymptotic variance, showing that the tuned PTD is more efficient than the classical estimator and often competitive with or superior to PPI++ depending on the data-generating process. Empirical results on AlphaFold, housing price, tree cover, and census data demonstrate valid coverage and narrower intervals than complete-case analyses, with practical guidance on tuning matrix choice and CI construction. Overall, the framework provides robust, broadly applicable inference for prediction-powered analyses in diverse scientific domains.

Abstract

Machine learning models are increasingly used to produce predictions that serve as input data in subsequent statistical analyses. For example, computer vision predictions of economic and environmental indicators based on satellite imagery are used in downstream regressions; similarly, language models are widely used to approximate human ratings and opinions in social science research. However, failure to properly account for errors in the machine learning predictions renders standard statistical procedures invalid. Prior work uses what we call the Predict-Then-Debias estimator to give valid confidence intervals when machine learning algorithms impute missing variables, assuming a small complete sample from the population of interest. We expand the scope by introducing bootstrap confidence intervals that apply when the complete data is a nonuniform (i.e., weighted, stratified, or clustered) sample and to settings where an arbitrary subset of features is imputed. Importantly, the method can be applied to many settings without requiring additional calculations. We prove that these confidence intervals are valid under no assumptions on the quality of the machine learning model and are no wider than the intervals obtained by methods that do not use machine learning predictions.

Prediction-Powered Inference with Imputed Covariates and Nonuniform Sampling

TL;DR

This work tackles the challenge of performing valid inference when downstream analyses use machine-learning predictions for covariates. It introduces Predict-Then-Debias (PTD), a simple yet powerful estimator that combines a biased, high-volume estimator with a bias-correction term from a small complete sample, and generalizes it with an optimal tuning matrix to boost efficiency; the key asymptotic form is , where is a function of the covariances between the estimators. The paper develops bootstrap-based confidence intervals that remain valid under nonuniform two-phase sampling and arbitrary imputation, including extensions to cluster and stratified designs, and introduces a faster convolution-based bootstrap to speed up computation. It also derives optimal tuning matrices (full and diagonal) that minimize asymptotic variance, showing that the tuned PTD is more efficient than the classical estimator and often competitive with or superior to PPI++ depending on the data-generating process. Empirical results on AlphaFold, housing price, tree cover, and census data demonstrate valid coverage and narrower intervals than complete-case analyses, with practical guidance on tuning matrix choice and CI construction. Overall, the framework provides robust, broadly applicable inference for prediction-powered analyses in diverse scientific domains.

Abstract

Machine learning models are increasingly used to produce predictions that serve as input data in subsequent statistical analyses. For example, computer vision predictions of economic and environmental indicators based on satellite imagery are used in downstream regressions; similarly, language models are widely used to approximate human ratings and opinions in social science research. However, failure to properly account for errors in the machine learning predictions renders standard statistical procedures invalid. Prior work uses what we call the Predict-Then-Debias estimator to give valid confidence intervals when machine learning algorithms impute missing variables, assuming a small complete sample from the population of interest. We expand the scope by introducing bootstrap confidence intervals that apply when the complete data is a nonuniform (i.e., weighted, stratified, or clustered) sample and to settings where an arbitrary subset of features is imputed. Importantly, the method can be applied to many settings without requiring additional calculations. We prove that these confidence intervals are valid under no assumptions on the quality of the machine learning model and are no wider than the intervals obtained by methods that do not use machine learning predictions.

Paper Structure

This paper contains 62 sections, 15 theorems, 129 equations, 5 figures, 1 table.

Key Result

Proposition 2.1

Under Assumptions assump:SamplingLabelling and assump:AsymptoticLinearity, if $\hat{\Omega} \xrightarrow{p} \Omega$ for some $\Omega \in \mathbb{R}^{d \times d}$, then as $N \to \infty$, $\sqrt{N} ( \hat{\theta}^{\textnormal{PTD},\hat{\Omega}} -\theta ) \xrightarrow{d} \mathcal{N} (0, \Sigma_{\textn

Figures (5)

  • Figure 1: Histograms of nightlight coefficient estimator across $10{,}000$ simulations for three different estimation strategies. Each simulation used a random sample of size $40{,}000$ from the dataset from MOSAIKSPaper, with $n=1{,}500$ samples randomly assigned to the complete sample. The dashed vertical line gives the "true coefficient" for nightlights based on fitting a regression using the gold standard data from all available samples. See Section \ref{['sec:HousingPriceExample']} for more details on the dataset and the regression setup.
  • Figure 2: Half-violin plots of point estimates and confidence interval widths from the 7 experiments, each with 500 simulations. The panel names give the coefficient name and the number in parenthesis in each title gives the corresponding experiment number, according to the enumeration of experiments in Table \ref{['table:ExperimentSummary']}. For the green half-violin plots, Algorithm \ref{['alg:FullPercentileBootstrap']} was used to construct confidence intervals (except for Experiments 5 and 7, where Algorithms \ref{['alg:ClusterBootstrap']} and \ref{['alg:StratifiedBootstrap']} were used, respectively).
  • Figure 3: Confidence interval widths and empirical coverage for different confidence interval construction approaches. Coefficients are normalized by the mean confidence interval width of the classical estimator. In the left column each point gives the average width of the 90% confidence interval across 500 simulations for a given regression coefficient and method. The error bars give $\pm 1$ standard deviations of the confidence interval widths. The number in parenthesis on the y-axis denotes which experiment is being plotted, according to the enumeration of experiments in Table \ref{['table:ExperimentSummary']}. The right panel gives the empirical coverage across the 500 simulations for each method, experiment and coefficient, and the dashed vertical line is the desired coverage of 0.9. (CLT-based and convolution bootstrap-based speedups to the PTD method were not implemented in all instances, given that their implementation requires additional mathematical calculations.)
  • Figure 4: Confidence interval widths and empirical coverage for different tuning matrix choices. Other aspects of the plot are as in Figure \ref{['fig:ExperimentsVaryingCIMethod']}.
  • Figure A1: Letting $X \sim \mathcal{N}(0,0.25^2)$, $\tilde{X}=X+E$ where $E \raisebox{0.05em}{origin=c]{90}{$\models$}} X$ with $E \sim \mathcal{N}(0,1)$ and $q=0.99$, the following R code generates $10^8$ samples of $(X,\tilde{X})$ and uses these $10^8$ samples to construct 99.99% confidence intervals for $\mathrm{Corr}(I\{X \leq F^{-1}(q) \}, I\{\tilde{X} \leq F^{-1}(q) \})$ and $\mathrm{Corr}(I\{X \leq F^{-1}(q) \}, I\{ \tilde{X} \leq \tilde{F}^{-1}(q) \})$. These confidence intervals are far apart giving strong numerical evidence that $\mathrm{Corr}(I\{X \leq F^{-1}(q) \}, I\{\tilde{X} \leq F^{-1}(q) \})>\mathrm{Corr}(I\{X \leq F^{-1}(q) \}, I\{ \tilde{X} \leq \tilde{F}^{-1}(q) \})$, which by Equation \ref{['eq:PPI++VersusTPPARE']} further implies that $\sigma_{\text{PPI++}}^2 < \sigma_{\text{TPTD}}$.

Theorems & Definitions (30)

  • Proposition 2.1
  • Proposition 2.2
  • Remark 1: Bootstrap Consistency for Z-estimators
  • Remark 2: Hadamard differentiable estimators
  • Theorem 3.1
  • Theorem 3.2
  • Corollary A.1
  • proof
  • Remark 3
  • Remark 4
  • ...and 20 more