Revisiting inference after prediction

Keshav Motwani; Daniela Witten

Revisiting inference after prediction

Keshav Motwani, Daniela Witten

TL;DR

Prediction-based inference uses $(\hat{f}(Z), X)$ to assess the $Y$–$X$ association when $Y$ is costly to observe. The paper compares Wang et al. (2020) and Angelopoulos et al. (2023) proposals and shows that Angelopoulos targets the correct parameter $\beta^* = E[XX^\top]^{-1}E[XY]$, while Wang's corrections target a different quantity and can fail to control Type I error or achieve nominal coverage for general $\hat{f}$. Through simulations and a direct replication of Wang 2020's study, the authors demonstrate that Wang's corrections are often anticonservative and give poor coverage unless $\hat{f}$ is effectively perfect, whereas Angelopoulos's debiasing approach yields valid inference regardless of $\hat{f}$ quality; under the extreme but unrealistic case $\hat{f} = E[Y|Z]$, all methods align. The extreme analysis underscores that alignment with $\beta^*$ can occur only under strong assumptions, while the results generalize beyond linear models. Overall, the work advocates using debiasing-based prediction inference for reliable semi-supervised inference across diverse prediction settings and provides code to reproduce the results.

Abstract

Recent work has focused on the very common practice of prediction-based inference: that is, (i) using a pre-trained machine learning model to predict an unobserved response variable, and then (ii) conducting inference on the association between that predicted response and some covariates. As pointed out by Wang et al. (2020), applying a standard inferential approach in (ii) does not accurately quantify the association between the unobserved (as opposed to the predicted) response and the covariates. In recent work, Wang et al. (2020) and Angelopoulos et al. (2023) propose corrections to step (ii) in order to enable valid inference on the association between the unobserved response and the covariates. Here, we show that the method proposed by Angelopoulos et al. (2023) successfully controls the type 1 error rate and provides confidence intervals with correct nominal coverage, regardless of the quality of the pre-trained machine learning model used to predict the unobserved response. However, the method proposed by Wang et al. (2020) provides valid inference only under very strong conditions that rarely hold in practice: for instance, if the machine learning model perfectly estimates the true regression function in the study population of interest.

Revisiting inference after prediction

TL;DR

Prediction-based inference uses

to assess the

–

association when

is costly to observe. The paper compares Wang et al. (2020) and Angelopoulos et al. (2023) proposals and shows that Angelopoulos targets the correct parameter

, while Wang's corrections target a different quantity and can fail to control Type I error or achieve nominal coverage for general

. Through simulations and a direct replication of Wang 2020's study, the authors demonstrate that Wang's corrections are often anticonservative and give poor coverage unless

is effectively perfect, whereas Angelopoulos's debiasing approach yields valid inference regardless of

quality; under the extreme but unrealistic case

, all methods align. The extreme analysis underscores that alignment with

can occur only under strong assumptions, while the results generalize beyond linear models. Overall, the work advocates using debiasing-based prediction inference for reliable semi-supervised inference across diverse prediction settings and provides code to reproduce the results.

Abstract

Paper Structure (10 sections, 1 theorem, 23 equations, 6 figures, 1 algorithm)

This paper contains 10 sections, 1 theorem, 23 equations, 6 figures, 1 algorithm.

Introduction
What parameter is each method targeting?
The general case for an arbitrary prediction model $\hat{f}$
An extreme setting where all methods target the correct quantity
An empirical investigation of the distribution of the test statistic
A direct replication of the simulation study of wang2020methods
Discussion
Necessity of consistency for $\beta^*$
Lack of consistency of analytical method of wang2020methods
Inferential consequences of wrong distribution

Key Result

Lemma 1

Suppose $\widehat{\mathop{\mathrm{SE}}\nolimits}(\hat{\beta}_j) = o_p(1)$ and $\hat{\beta}_j \overset{p}{\not \to} \beta^*_j$. Then $(\hat{\beta}_j - \beta^*_j)/\widehat{\mathop{\mathrm{SE}}\nolimits}(\hat{\beta}_j)$ does not converge in distribution.

Figures (6)

Figure 1: An examination of the distribution of $\hat{\beta}_1 / \widehat{\mathop{\mathrm{SE}}\nolimits}(\hat{\beta}_1)$ under $H_0: \beta_1^* = 0$. For each of four different prediction models $\hat{f}(\cdot)$ (three trained GAMs and one true regression function), we display the empirical distribution of $\hat{\beta}_1 / \widehat{\mathop{\mathrm{SE}}\nolimits}(\hat{\beta}_1)$ as the sample sizes increase, with $n_{\text{lab}} = 0.1 n_{\text{unlab}}$. The $N(0,1)$ distribution is shown in black. The dashed black lines show the $0.025$ and $0.975$ quantiles of this distribution. The distributions of wang2020methods's test statistics increasingly diverge from the $N(0,1)$ distribution as the sample sizes increase. The methods and simulation setup are described in Section 3.
Figure 2: An examination of the distribution of $(\hat{\beta}_1 - \beta_1^*) / \widehat{\mathop{\mathrm{SE}}\nolimits}(\hat{\beta}_1)$ when $\beta_1^* = 1$. For each of four different prediction models $\hat{f}(\cdot)$ (three trained GAMs and one true regression function), we display the empirical distribution of $(\hat{\beta}_1 - \beta_1^*) / \widehat{\mathop{\mathrm{SE}}\nolimits}(\hat{\beta}_1)$ as the sample sizes increase, with $n_{\text{lab}} = 0.1 n_{\text{unlab}}$. The $N(0,1)$ distribution is shown in black. The dashed black lines show the $0.025$ and $0.975$ quantiles of this distribution. The distributions of wang2020methods's test statistics increasingly diverge from the $N(0,1)$ distribution as the sample sizes increase. The methods and simulation setup are described in Section 3.
Figure 3: For data generated under $H_0: \beta_1^* = 0$, quantile-quantile plots of the p-values across simulation replicates are displayed. The methods are described in Section 3 and the simulation setup is described in Section 4. Each panel corresponds to a different sample sizes of the labeled and unlabeled datasets used for inference. The bootstrap and analytical corrections considered by wang2020methods become increasingly anticonservative as the sample sizes increase. The classical approach, and that of angelopoulos2023prediction, are well-calibrated.
Figure 4: For data generated with $\beta_1^* = 1$, empirical coverage of 95% confidence intervals for each method across each simulation replicate, as the labeled and unlabeled sample sizes increase, with $n_{\text{lab}} = 0.1 n_{\text{unlab}}$. The methods are described in Section 3 and the simulation setup is described in Section 4.
Figure 5: For labeled and unlabeled datasets generated under $H_0: \beta_1^* = 0$, quantile-quantile plots of the p-values across replicates of the modified simulation study are displayed for each of the four $\hat{f}(\cdot)$'s considered. The methods and simulation setup are described in Section \ref{['subsec:se']}. Each panel corresponds to different sample sizes of the labeled and unlabeled datasets used for inference. The naive method and the bootstrap and analytical corrections considered by wang2020methods become increasingly anticonservative as the sample sizes increases, unless the machine learning model is perfect, i.e. $\hat{f}(z) = \mathop{\mathrm{E}}\nolimits[Y | Z = z]$.
...and 1 more figures

Theorems & Definitions (2)

Lemma 1
proof

Revisiting inference after prediction

TL;DR

Abstract

Revisiting inference after prediction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (2)