Neural interval-censored survival regression with feature selection

Carlos García Meixide; Marcos Matabuena; Louis Abraham; Michael R. Kosorok

Neural interval-censored survival regression with feature selection

Carlos García Meixide, Marcos Matabuena, Louis Abraham, Michael R. Kosorok

TL;DR

The paper tackles interval-censored survival regression in high-dimensional settings by proposing a neural-network–based Accelerated Failure Time (AFT) framework. It combines a Lassonet-inspired sparse neural network for variable selection with a residual-network predictor for the interval-censored log-time, modeled as $\\log T = r_{\\theta,W}(Z) + \varepsilon$, where $\\varepsilon \sim N(0,\\sigma^2)$. Key contributions include stability-selection based variable selection for interval censoring, a three-stage sample-splitting protocol for honest evaluation, and real-data demonstrations on Type 1 Diabetes Exchange and NHANES, along with open-source software. The results show improved predictive accuracy in nonlinear settings over traditional LogNormal AFT models, along with interpretable variable selections and practical guidance for researchers facing interval-censored, high-dimensional biomedical data.

Abstract

Survival analysis is a fundamental area of focus in biomedical research, particularly in the context of personalized medicine. This prominence is due to the increasing prevalence of large and high-dimensional datasets, such as omics and medical image data. However, the literature on non-linear regression algorithms and variable selection techniques for interval-censoring is either limited or non-existent, particularly in the context of neural networks. Our objective is to introduce a novel predictive framework tailored for interval-censored regression tasks, rooted in Accelerated Failure Time (AFT) models. Our strategy comprises two key components: i) a variable selection phase leveraging recent advances on sparse neural network architectures, ii) a regression model targeting prediction of the interval-censored response. To assess the performance of our novel algorithm, we conducted a comprehensive evaluation through both numerical experiments and real-world applications that encompass scenarios related to diabetes and physical activity. Our results outperform traditional AFT algorithms, particularly in scenarios featuring non-linear relationships.

Neural interval-censored survival regression with feature selection

TL;DR

, where

. Key contributions include stability-selection based variable selection for interval censoring, a three-stage sample-splitting protocol for honest evaluation, and real-data demonstrations on Type 1 Diabetes Exchange and NHANES, along with open-source software. The results show improved predictive accuracy in nonlinear settings over traditional LogNormal AFT models, along with interpretable variable selections and practical guidance for researchers facing interval-censored, high-dimensional biomedical data.

Abstract

Paper Structure (17 sections, 14 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 14 equations, 8 figures, 4 tables, 1 algorithm.

Introduction
The difficulty of statistical inference with interval-censored data
Summary of contributions
Structure of the paper
Methods
"Case-2" interval-censoring
Variable selection
Sparse neural networks
Stability selection
Predictive model
Sample-splitting
Simulated experiments
Generation of interval-censored targets
Description of the simulation setups and results
Practical considerations
...and 2 more sections

Figures (8)

Figure 1: The stability selection technique applied to synthetic data: example 1 with $n=1000$.
Figure 2: Example diagram of a feedforward neural network with an skip layer. The architecture used in the application to NHANES data would be the same, adding an extra layer with the same number of hidden units (10). The total number of parameters, in that case, is $604$. Bias terms are omitted in the drawing.
Figure 3: Graphical deliverable of Section \ref{['wrap']}. For a desirable expected proportion of removed variables $x$ and a false discovery rate $F$ both fixed by the user, the variables eventually entering the model are the ones whose frequency curve $\pi_j$ surpasess the dotted black line given by $F$ at $x$
Figure 4: Predicted regression surface of our neural AFT model upon setting all the covariates to baseline value except from A1c and received units of insulin. Each plot corresponds to setting the treatment indicator to 0 or 1 in addition respectively, being the arms labeled in the title of the plots. Minimum and maximum values of these covariates are encoded in the plot as 0 and 1 respectively. This figure pictures the general idea that, if certain assumptions are met, neural networks can capture complex, non-linear interactions between features without explicitly modeling them. Logarithm of survival time as a function of these two variables is already highly non-linear per se, and when a third categorical variable comes into play we see that the regression surface is not modified by a simple shift. For the control arm, it is harder to reduce risk of albuminuria concentration disorder by providing insulin to patients with high values of A1c.
Figure 5: Marginal Turnbull estimators stratified by HbA1c values. Blue curve encompasses data satisfying HbA1c $<8$ and red involves observations such that HbA1c $\geq 8$.
...and 3 more figures

Neural interval-censored survival regression with feature selection

TL;DR

Abstract

Neural interval-censored survival regression with feature selection

Authors

TL;DR

Abstract

Table of Contents

Figures (8)