Prediction-Powered Inference with Imputed Covariates and Nonuniform Sampling
Dan M. Kluger, Kerri Lu, Tijana Zrnic, Sherrie Wang, Stephen Bates
TL;DR
This work tackles the challenge of performing valid inference when downstream analyses use machine-learning predictions for covariates. It introduces Predict-Then-Debias (PTD), a simple yet powerful estimator that combines a biased, high-volume estimator with a bias-correction term from a small complete sample, and generalizes it with an optimal tuning matrix $\hat{\Omega}$ to boost efficiency; the key asymptotic form is $\sqrt{N}(\hat{\theta}^{\textnormal{PTD},\hat{\Omega}} - \theta) \rightsquigarrow \mathcal{N}(0, \Sigma_{\text{PTD}}(\Omega))$, where $\Sigma_{\text{PTD}}(\Omega)$ is a function of the covariances between the estimators. The paper develops bootstrap-based confidence intervals that remain valid under nonuniform two-phase sampling and arbitrary imputation, including extensions to cluster and stratified designs, and introduces a faster convolution-based bootstrap to speed up computation. It also derives optimal tuning matrices (full and diagonal) that minimize asymptotic variance, showing that the tuned PTD is more efficient than the classical estimator and often competitive with or superior to PPI++ depending on the data-generating process. Empirical results on AlphaFold, housing price, tree cover, and census data demonstrate valid coverage and narrower intervals than complete-case analyses, with practical guidance on tuning matrix choice and CI construction. Overall, the framework provides robust, broadly applicable inference for prediction-powered analyses in diverse scientific domains.
Abstract
Machine learning models are increasingly used to produce predictions that serve as input data in subsequent statistical analyses. For example, computer vision predictions of economic and environmental indicators based on satellite imagery are used in downstream regressions; similarly, language models are widely used to approximate human ratings and opinions in social science research. However, failure to properly account for errors in the machine learning predictions renders standard statistical procedures invalid. Prior work uses what we call the Predict-Then-Debias estimator to give valid confidence intervals when machine learning algorithms impute missing variables, assuming a small complete sample from the population of interest. We expand the scope by introducing bootstrap confidence intervals that apply when the complete data is a nonuniform (i.e., weighted, stratified, or clustered) sample and to settings where an arbitrary subset of features is imputed. Importantly, the method can be applied to many settings without requiring additional calculations. We prove that these confidence intervals are valid under no assumptions on the quality of the machine learning model and are no wider than the intervals obtained by methods that do not use machine learning predictions.
