Task-Agnostic Machine-Learning-Assisted Inference

Jiacheng Miao; Qiongshi Lu

Task-Agnostic Machine-Learning-Assisted Inference

Jiacheng Miao, Qiongshi Lu

TL;DR

A novel statistical framework named PSPS is introduced for task-agnostic ML-assisted inference that delivers valid and efficient inference that is robust to arbitrary choice of ML model, allowing nearly all existing statistical frameworks to be incorporated into the analysis of ML-predicted data.

Abstract

Machine learning (ML) is playing an increasingly important role in scientific research. In conjunction with classical statistical approaches, ML-assisted analytical strategies have shown great promise in accelerating research findings. This has also opened a whole field of methodological research focusing on integrative approaches that leverage both ML and statistics to tackle data science challenges. One type of study that has quickly gained popularity employs ML to predict unobserved outcomes in massive samples, and then uses predicted outcomes in downstream statistical inference. However, existing methods designed to ensure the validity of this type of post-prediction inference are limited to very basic tasks such as linear regression analysis. This is because any extension of these approaches to new, more sophisticated statistical tasks requires task-specific algebraic derivations and software implementations, which ignores the massive library of existing software tools already developed for the same scientific problem given observed data. This severely constrains the scope of application for post-prediction inference. To address this challenge, we introduce a novel statistical framework named PSPS for task-agnostic ML-assisted inference. It provides a post-prediction inference solution that can be easily plugged into almost any established data analysis routines. It delivers valid and efficient inference that is robust to arbitrary choice of ML model, allowing nearly all existing statistical frameworks to be incorporated into the analysis of ML-predicted data. Through extensive experiments, we showcase our method's validity, versatility, and superiority compared to existing approaches. Our software is available at https://github.com/qlu-lab/psps.

Task-Agnostic Machine-Learning-Assisted Inference

TL;DR

Abstract

Paper Structure (28 sections, 3 theorems, 25 equations, 5 figures, 2 tables, 3 algorithms)

This paper contains 28 sections, 3 theorems, 25 equations, 5 figures, 2 tables, 3 algorithms.

Introduction
Problem formulations
Setting
Building the intuition with mean estimation
Related work
Methods
General protocol for PSPS
Theoretical guarantees
Extensions
Labeled data and unlabeled data are not independent
Sensitivity analysis for distributional shift between labeled and unlabeled data
ML-assisted FDR control
ML-assisted inference with predicted features
Numerical experiments and real data application
Simulations
...and 13 more sections

Key Result

Theorem 1

Assuming equation eq:esteq holds, then $\widehat{{\boldsymbol{\theta}}}_\textnormal{PSPS} \stackrel{P}{\rightarrow} {\boldsymbol{\theta}}^*$, and where ${\bf V}(\widehat{{\boldsymbol{\theta}}}_\textnormal{PSPS}) = {\bf V}(\widehat{{\boldsymbol{\theta}}}_{\cal L}) - {\bf V}(\widehat{{\boldsymbol{\theta}}}_{\cal L}, \widehat{\boldsymbol\eta}_{\cal L})^{\rm T}( {\bf V}(\widehat{\boldsymbol\eta}_{\ca

Figures (5)

Figure 1: Workflow of PSPS for Task-Agnostic ML-Assisted Inference.
Figure 2: Simulation for tasks that have been implemented for ML-assisted inference including mean estimation, linear regression, and logistic regression from left to right. Panel a-c present confidence interval coverage and panels d-f present confidence interval width.
Figure 3: Simulation for tasks that have not been implemented for ML-assisted inference including quantile regression, instrumental variable (IV) regression, negative binomial (NB) regression, debiased Lasso, and Wilcoxon rank-sum test from left to right. Panels a-e present confidence interval coverage (1 - type-error for Wilcoxon rank-sum test) and panels f-j present confidence interval width (1 - power for Wilcoxon rank-sum test).
Figure E.1: Simulation for FDR control. Panel a-b shows the estimated FDR level given the expected FDR. Panel c-d shows the power.
Figure E.2: Manhattan plot of vQTLs for bone mineral density. The X-axis represents chromosomes (CHR), plotted by base pair positions (BP). Each point on the plot indicates a single nucleotide polymorphism (SNP). The Y-axis depicts -log10(p-values).

Theorems & Definitions (9)

Remark 1
Remark 2
Theorem 1
Proposition 1
Proposition 2
Remark 3
proof
proof
proof

Task-Agnostic Machine-Learning-Assisted Inference

TL;DR

Abstract

Task-Agnostic Machine-Learning-Assisted Inference

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (9)