Do We Really Even Need Data? A Modern Look at Drawing Inference with Predicted Data
Stephen Salerno, Kentaro Hoffman, Awan Afiaz, Anna Neufeld, Tyler H. McCormick, Jeffrey T. Leek
TL;DR
This paper analyzes the statistical hazards of drawing inference from predicted data (IPD) and shows that high predictive accuracy does not guarantee valid downstream conclusions due to bias propagation and variance underestimation. It formalizes the IPD problem, decomposes error sources, and demonstrates through illustrative and real-case examples how naïve use of predicted outcomes distorts estimands and uncertainty. The authors survey a rapidly expanding set of assumption-lean IPD methods (e.g., PostPI, PPI, PSPA, RePPI) that calibrate predictions using a small gold-standard labeled set to achieve valid inference, often approaching semiparametric efficiency. They also discuss practical considerations, limitations, and future directions, emphasizing transparent reporting, design choices, and robust evaluation as predictions increasingly augment scientific data. Overall, IPD provides a principled bridge between AI-generated predictions and reliable statistical inference, enabling broader, more cost-effective, and trustworthy scientific analyses across disciplines.
Abstract
As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g., rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as substitutes for missing or unobserved data. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association between independent variables and the outcome of interest when the true, unobserved outcome is replaced by a predicted value. In this paper, we characterize the statistical challenges inherent to drawing inference with predicted data (IPD) and show that high predictive accuracy does not guarantee valid downstream inference. We show that all such failures reduce to statistical notions of (i) bias, when predictions systematically shift the estimand or distort relationships among variables, and (ii) variance, when uncertainty from the prediction model and the intrinsic variability of the true data are ignored. We then review recent methods for conducting IPD and discuss how this framework is deeply rooted in classical statistical theory. We then comment on some open questions and interesting avenues for future work in this area, and end with some comments on how to use predicted data in scientific studies that is both transparent and statistically principled.
