Table of Contents
Fetching ...

U-learning for Prediction Inference via Combinatory Multi-Subsampling: With Applications to LASSO and Neural Networks

Zhe Fei, Yi Li

TL;DR

This paper tackles the challenge of valid prediction inference for high-dimensional and nonparametric learners by introducing U-learning, a combinatory multi-subsampling framework that treats ensemble predictions as generalized U-statistics and leverages the Hájek projection for model-free variance estimation. The authors develop CMS-based procedures for both Lasso and deep neural networks, proving asymptotic normality and consistent variance estimation to yield conditional, per-subject prediction intervals. Theoretical results are complemented by extensive numerical experiments and a real-data application to epigenetic aging clocks, demonstrating competitive accuracy and reliable uncertainty quantification across tissues. The approach enables principled, instance-specific confidence intervals for complex predictive models, with potential impact on personalized aging interventions and other high-dimensional prediction tasks.

Abstract

Epigenetic aging clocks play a pivotal role in estimating an individual's biological age through the examination of DNA methylation patterns at numerous CpG (Cytosine-phosphate-Guanine) sites within their genome. However, making valid inferences on predicted epigenetic ages, or more broadly, on predictions derived from high-dimensional inputs, presents challenges. We introduce a novel U-learning approach via combinatory multi-subsampling for making ensemble predictions and constructing confidence intervals for predictions of continuous outcomes when traditional asymptotic methods are not applicable. More specifically, our approach conceptualizes the ensemble estimators within the framework of generalized U-statistics and invokes the Hájek projection for deriving the variances of predictions and constructing confidence intervals with valid conditional coverage probabilities. We apply our approach to two commonly used predictive algorithms, Lasso and deep neural networks (DNNs), and illustrate the validity of inferences with extensive numerical studies. We have applied these methods to predict the DNA methylation age (DNAmAge) of patients with various health conditions, aiming to accurately characterize the aging process and potentially guide anti-aging interventions.

U-learning for Prediction Inference via Combinatory Multi-Subsampling: With Applications to LASSO and Neural Networks

TL;DR

This paper tackles the challenge of valid prediction inference for high-dimensional and nonparametric learners by introducing U-learning, a combinatory multi-subsampling framework that treats ensemble predictions as generalized U-statistics and leverages the Hájek projection for model-free variance estimation. The authors develop CMS-based procedures for both Lasso and deep neural networks, proving asymptotic normality and consistent variance estimation to yield conditional, per-subject prediction intervals. Theoretical results are complemented by extensive numerical experiments and a real-data application to epigenetic aging clocks, demonstrating competitive accuracy and reliable uncertainty quantification across tissues. The approach enables principled, instance-specific confidence intervals for complex predictive models, with potential impact on personalized aging interventions and other high-dimensional prediction tasks.

Abstract

Epigenetic aging clocks play a pivotal role in estimating an individual's biological age through the examination of DNA methylation patterns at numerous CpG (Cytosine-phosphate-Guanine) sites within their genome. However, making valid inferences on predicted epigenetic ages, or more broadly, on predictions derived from high-dimensional inputs, presents challenges. We introduce a novel U-learning approach via combinatory multi-subsampling for making ensemble predictions and constructing confidence intervals for predictions of continuous outcomes when traditional asymptotic methods are not applicable. More specifically, our approach conceptualizes the ensemble estimators within the framework of generalized U-statistics and invokes the Hájek projection for deriving the variances of predictions and constructing confidence intervals with valid conditional coverage probabilities. We apply our approach to two commonly used predictive algorithms, Lasso and deep neural networks (DNNs), and illustrate the validity of inferences with extensive numerical studies. We have applied these methods to predict the DNA methylation age (DNAmAge) of patients with various health conditions, aiming to accurately characterize the aging process and potentially guide anti-aging interventions.
Paper Structure (10 sections, 7 theorems, 71 equations, 4 figures, 4 tables, 2 algorithms)

This paper contains 10 sections, 7 theorems, 71 equations, 4 figures, 4 tables, 2 algorithms.

Key Result

Lemma 3.1

Under C1, and for $K_n > K_0$ as defined in C3, the prediction $\widetilde{y}_*^{\mathring{b}}$ based on (lasso1) satisfies

Figures (4)

  • Figure 1: Prediction and inference in simulation examples 1. a,b)$n=500$; c,d)$n=1000$. Left panels show the average CIP on the test samples; right panels show the prediction SE versus the empirical SD of all test samples.
  • Figure 2: Prediction intervals by U-learning (left) and Conformal Prediction (right) in Example 3 life expectancy data. Prediction intervals that do not cover the truth are in black. The labeled countries in panel (a) are (from left to right): Malawi, Sierra Leone, Mozambique, India, Djibouti, Belarus, China, Bangladesh, Romania, Vanuatu, Belgium.
  • Figure 3: DNA age based on out-of-bag predictions and the prediction intervals in three clocks: a. blood samples; b. non-blood samples; c. all samples.
  • Figure 4: DNA age prediction intervals and CIP for all test samples based on two models.

Theorems & Definitions (11)

  • Lemma 3.1
  • Theorem 3.2
  • Corollary 1
  • Lemma 3.3
  • Theorem 3.4
  • Corollary 2
  • Lemma 1
  • proof
  • proof : Proof of Theorem \ref{['thm_normality']}
  • proof : Proof of Corollary \ref{['thm_var']}
  • ...and 1 more