Table of Contents
Fetching ...

Do We Really Even Need Data? A Modern Look at Drawing Inference with Predicted Data

Stephen Salerno, Kentaro Hoffman, Awan Afiaz, Anna Neufeld, Tyler H. McCormick, Jeffrey T. Leek

TL;DR

This paper analyzes the statistical hazards of drawing inference from predicted data (IPD) and shows that high predictive accuracy does not guarantee valid downstream conclusions due to bias propagation and variance underestimation. It formalizes the IPD problem, decomposes error sources, and demonstrates through illustrative and real-case examples how naïve use of predicted outcomes distorts estimands and uncertainty. The authors survey a rapidly expanding set of assumption-lean IPD methods (e.g., PostPI, PPI, PSPA, RePPI) that calibrate predictions using a small gold-standard labeled set to achieve valid inference, often approaching semiparametric efficiency. They also discuss practical considerations, limitations, and future directions, emphasizing transparent reporting, design choices, and robust evaluation as predictions increasingly augment scientific data. Overall, IPD provides a principled bridge between AI-generated predictions and reliable statistical inference, enabling broader, more cost-effective, and trustworthy scientific analyses across disciplines.

Abstract

As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g., rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as substitutes for missing or unobserved data. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association between independent variables and the outcome of interest when the true, unobserved outcome is replaced by a predicted value. In this paper, we characterize the statistical challenges inherent to drawing inference with predicted data (IPD) and show that high predictive accuracy does not guarantee valid downstream inference. We show that all such failures reduce to statistical notions of (i) bias, when predictions systematically shift the estimand or distort relationships among variables, and (ii) variance, when uncertainty from the prediction model and the intrinsic variability of the true data are ignored. We then review recent methods for conducting IPD and discuss how this framework is deeply rooted in classical statistical theory. We then comment on some open questions and interesting avenues for future work in this area, and end with some comments on how to use predicted data in scientific studies that is both transparent and statistically principled.

Do We Really Even Need Data? A Modern Look at Drawing Inference with Predicted Data

TL;DR

This paper analyzes the statistical hazards of drawing inference from predicted data (IPD) and shows that high predictive accuracy does not guarantee valid downstream conclusions due to bias propagation and variance underestimation. It formalizes the IPD problem, decomposes error sources, and demonstrates through illustrative and real-case examples how naïve use of predicted outcomes distorts estimands and uncertainty. The authors survey a rapidly expanding set of assumption-lean IPD methods (e.g., PostPI, PPI, PSPA, RePPI) that calibrate predictions using a small gold-standard labeled set to achieve valid inference, often approaching semiparametric efficiency. They also discuss practical considerations, limitations, and future directions, emphasizing transparent reporting, design choices, and robust evaluation as predictions increasingly augment scientific data. Overall, IPD provides a principled bridge between AI-generated predictions and reliable statistical inference, enabling broader, more cost-effective, and trustworthy scientific analyses across disciplines.

Abstract

As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g., rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as substitutes for missing or unobserved data. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association between independent variables and the outcome of interest when the true, unobserved outcome is replaced by a predicted value. In this paper, we characterize the statistical challenges inherent to drawing inference with predicted data (IPD) and show that high predictive accuracy does not guarantee valid downstream inference. We show that all such failures reduce to statistical notions of (i) bias, when predictions systematically shift the estimand or distort relationships among variables, and (ii) variance, when uncertainty from the prediction model and the intrinsic variability of the true data are ignored. We then review recent methods for conducting IPD and discuss how this framework is deeply rooted in classical statistical theory. We then comment on some open questions and interesting avenues for future work in this area, and end with some comments on how to use predicted data in scientific studies that is both transparent and statistically principled.

Paper Structure

This paper contains 25 sections, 22 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Artist renderings of a rhinoceros based on limited information. Left: Albrecht Dürer's The Rhinoceros, woodcut (1515); Right: C.M. Kösemen's paleoart reconstruction of a rhinoceros based on its skeleton (2012).
  • Figure 2: Overview of the setup for inference with predicted data.
  • Figure 3: Illustrative example of bias and variance when regressing a predicted outcome on the predictor of interest. Each panel plots the outcome vs. $Z_1$: gray points are the true $Y$, blue points are the predicted $\hat{Y}$ from rules trained on an independent sample. The black dashed line is the true slope, the blue line and ribbon are the fitted line and its standard error from regressing $\hat{Y}$ on $Z_1$. Prediction rules use different feature sets: (A) all ten features, (B) $Z_1, Z_2$, (C) $Z_2, \ldots, Z_{10}$ (excluding $Z_1$), (D) $Z_2, Z_3$ (excluding $Z_1$).
  • Figure 4: DXA percent body fat vs. two proxy measures. Shaded regions within dashed lines mark obesity thresholds for each measure, showing misclassification under either proxy measure (top-left and bottom-right regions of each panel) and the overlap between them (top-right) for BMI (kg/m$^2$, left) and waist circumference (cm, right). This misclassification differs for females (top, yellow) versus males (bottom, blue).
  • Figure 5: Coefficient (log-odds) estimates and 95% confidence intervals for logistic regressions of obesity on certain demographic risk factors (age, sex, and race). Obesity is a binary outcome defined based on pre-specified thresholds for three continuous measures of adiposity: dual-energy X-ray absorptiometry (DXA)-based adiposity % body fat (green), body mass index (BMI; kg/m$^2$, blue), and waist circumference (WC; cm, yellow).
  • ...and 4 more figures