Table of Contents
Fetching ...

Assumption-Lean and Data-Adaptive Post-Prediction Inference

Jiacheng Miao, Xinran Miao, Yixuan Wu, Jiwei Zhao, Qiongshi Lu

TL;DR

This work introduces PoSt-Prediction Adaptive inference (PSPA) that allows valid and powerful inference based on ML-predicted data, and guarantees reliable statistical inference without assumptions on the ML prediction.

Abstract

A primary challenge facing modern scientific research is the limited availability of gold-standard data which can be costly, labor-intensive, or invasive to obtain. With the rapid development of machine learning (ML), scientists can now employ ML algorithms to predict gold-standard outcomes with variables that are easier to obtain. However, these predicted outcomes are often used directly in subsequent statistical analyses, ignoring imprecision and heterogeneity introduced by the prediction procedure. This will likely result in false positive findings and invalid scientific conclusions. In this work, we introduce PoSt-Prediction Adaptive inference (PSPA) that allows valid and powerful inference based on ML-predicted data. Its "assumption-lean" property guarantees reliable statistical inference without assumptions on the ML prediction. Its "data-adaptive" feature guarantees an efficiency gain over existing methods, regardless of the accuracy of ML prediction. We demonstrate the statistical superiority and broad applicability of our method through simulations and real-data applications.

Assumption-Lean and Data-Adaptive Post-Prediction Inference

TL;DR

This work introduces PoSt-Prediction Adaptive inference (PSPA) that allows valid and powerful inference based on ML-predicted data, and guarantees reliable statistical inference without assumptions on the ML prediction.

Abstract

A primary challenge facing modern scientific research is the limited availability of gold-standard data which can be costly, labor-intensive, or invasive to obtain. With the rapid development of machine learning (ML), scientists can now employ ML algorithms to predict gold-standard outcomes with variables that are easier to obtain. However, these predicted outcomes are often used directly in subsequent statistical analyses, ignoring imprecision and heterogeneity introduced by the prediction procedure. This will likely result in false positive findings and invalid scientific conclusions. In this work, we introduce PoSt-Prediction Adaptive inference (PSPA) that allows valid and powerful inference based on ML-predicted data. Its "assumption-lean" property guarantees reliable statistical inference without assumptions on the ML prediction. Its "data-adaptive" feature guarantees an efficiency gain over existing methods, regardless of the accuracy of ML prediction. We demonstrate the statistical superiority and broad applicability of our method through simulations and real-data applications.
Paper Structure (32 sections, 10 theorems, 58 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 32 sections, 10 theorems, 58 equations, 6 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1

Under Conditions (C1)-(C4), assuming $\frac{n}{N} \rightarrow \rho$ as $n\to\infty$ and $N \rightarrow \infty$, then the proposed estimator $\widehat{{\boldsymbol{\theta}}}_{\textnormal{PSPA}}({{\boldsymbol\omega}})$ converges to ${\boldsymbol{\theta}}$ in probability. Assuming additionally Conditio where ${\boldsymbol{\Sigma}}({\boldsymbol\omega}) = {\bf A}^{-1}{\bf V}({\boldsymbol\omega}){\bf A}

Figures (6)

  • Figure 1: Comparison of PPI, PSPA, and classical method in identifying sex-biased gene expressions using GTEx data. (a) number of sex-biased genes identified by each of the four approaches. (b) x-axis: absolute value of imputation correlation. y-axis: relative ratio of estimated standard error between PPI and classical method. (c) same as b but for our method PSPA. The dashed line represents $y = 1$ in (b)-(c).
  • Figure 2: Coverage of the confidence interval and relative ratio of its width compared to the classical method for linear and logistic regression. ML is used to predict the labels. Panels (a)-(d) show the coverage of the confidence interval. Panels (e)-(h) show the relative ratio of the width of the confidence interval in comparison with the classical method. Panels (a), (b), (e), and (f) correspond to settings with varying sample sizes of unlabeled data. Panels (c), (d), (g), and (h) correspond to settings with different levels of imputation accuracy. The dashed line represent $y = 0.95$ in (a)-(d) and $y=1$ in (e)-(h).
  • Figure 3: Coverage of the confidence interval and relative ratio of its width compared to the classical method for linear and logistic regression. ML is used to predict the covariates. Panels (a)-(d) show the coverage of the confidence interval. Panels (e)-(h) show the relative ratio of the width of the confidence interval in comparison with the classical method. Panels (a), (b), (e), and (f) correspond to settings with varying sample sizes of unlabeled data. Panels (c), (d), (g), and (h) correspond to settings with different levels of imputation accuracy. The dashed line represent $y = 0.95$ in (a)-(d) and $y=1$ in (e)-(h).
  • Figure 4: Comparison of PSPA, classical, PPI, and EIF$^{*}$-based approaches in identifying sex-biased gene expressions using GTEx data. Each panel illustrates a different aspect of comparison on the y- and x- axes: point estimates between the (a) classical and PPI approaches; (b) classical and EIF$^{*}$-based approaches; (c) classical and PSPA approaches; estimated standard errors between the (d) classical and PPI approaches; (e) classical and EIF$^{*}$-based approaches; (f) classical and PSPA approaches; (g) number of sex-biased genes identified by each of the four approaches. The dashed lines represent $y = x$ in (a)-(c) and $y=1$ in (d)-(f).
  • Figure 5: Comparison of PPI++ and PSPA in simulation. This figure shows the coverage of the confidence interval and the relative ratio of the width of the confidence interval compared to the classical method for linear and logistic regression. ML is used to predict the labels. Panels (a)-(d) show the coverage of the confidence interval. Panels (e)-(h) show the relative ratio of the width of the confidence interval in comparison with the classical method. Panels (a), (b), (e), and (f) correspond to settings with varying sample sizes of unlabeled data. Panels (c), (d), (g), and (h) correspond to settings with different levels of imputation accuracy. The dashed line represent $y = 0.95$ in (a)-(d) and $y=1$ in (e)-(h).
  • ...and 1 more figures

Theorems & Definitions (16)

  • Example 1: Sex-differentiated gene expressions
  • Remark 1: The consideration of ${\bf X}$ and ${\bf Z}$
  • Remark 2: Data-Adaptive Feature
  • Theorem 1
  • Corollary 1
  • Corollary 2
  • Proposition 1
  • Example 2: Linear Regression
  • Example 3: Logistic Regression
  • Proposition 2: Efficient Influence Function
  • ...and 6 more