DeepLINK-T: deep learning inference for time series data using knockoffs and LSTM

Wenxuan Zuo; Zifan Zhu; Yuxuan Du; Yi-Chun Yeh; Jed A. Fuhrman; Jinchi Lv; Yingying Fan; Fengzhu Sun

DeepLINK-T: deep learning inference for time series data using knockoffs and LSTM

Wenxuan Zuo, Zifan Zhu, Yuxuan Du, Yi-Chun Yeh, Jed A. Fuhrman, Jinchi Lv, Yingying Fan, Fengzhu Sun

TL;DR

DeepLINK-T tackles FDR-controlled feature selection in high-dimensional longitudinal time series by coupling an LSTM-based autoencoder to generate time-series knockoffs with an LSTM predictor that uses original and knockoff features. By deriving knockoff statistics $W_j=Z_j^2- ilde{Z}_j^2$ through a plug-in pairwise-coupling framework and applying model-X knockoffs theory, it achieves finite-sample FDR control under complex temporal dependencies modeled by latent factors. Simulation studies show robust FDR control and superior power relative to non-time-series methods, and real metagenomic data applications demonstrate biologically plausible, reproducible feature identification across multiple datasets and taxonomic levels. The work highlights the practical utility of integrating deep learning with knockoffs in longitudinal data and suggests avenues for improvements, including Transformer-based architectures and end-to-end training. Overall, DeepLINK-T provides a principled, scalable approach for interpretable, reproducible variable selection in time-series regression problems with high dimensional covariates.

Abstract

High-dimensional longitudinal time series data is prevalent across various real-world applications. Many such applications can be modeled as regression problems with high-dimensional time series covariates. Deep learning has been a popular and powerful tool for fitting these regression models. Yet, the development of interpretable and reproducible deep-learning models is challenging and remains underexplored. This study introduces a novel method, Deep Learning Inference using Knockoffs for Time series data (DeepLINK-T), focusing on the selection of significant time series variables in regression while controlling the false discovery rate (FDR) at a predetermined level. DeepLINK-T combines deep learning with knockoff inference to control FDR in feature selection for time series models, accommodating a wide variety of feature distributions. It addresses dependencies across time and features by leveraging a time-varying latent factor structure in time series covariates. Three key ingredients for DeepLINK-T are 1) a Long Short-Term Memory (LSTM) autoencoder for generating time series knockoff variables, 2) an LSTM prediction network using both original and knockoff variables, and 3) the application of the knockoffs framework for variable selection with FDR control. Extensive simulation studies have been conducted to evaluate DeepLINK-T's performance, showing its capability to control FDR effectively while demonstrating superior feature selection power for high-dimensional longitudinal time series data compared to its non-time series counterpart. DeepLINK-T is further applied to three metagenomic data sets, validating its practical utility and effectiveness, and underscoring its potential in real-world applications.

DeepLINK-T: deep learning inference for time series data using knockoffs and LSTM

TL;DR

through a plug-in pairwise-coupling framework and applying model-X knockoffs theory, it achieves finite-sample FDR control under complex temporal dependencies modeled by latent factors. Simulation studies show robust FDR control and superior power relative to non-time-series methods, and real metagenomic data applications demonstrate biologically plausible, reproducible feature identification across multiple datasets and taxonomic levels. The work highlights the practical utility of integrating deep learning with knockoffs in longitudinal data and suggests avenues for improvements, including Transformer-based architectures and end-to-end training. Overall, DeepLINK-T provides a principled, scalable approach for interpretable, reproducible variable selection in time-series regression problems with high dimensional covariates.

Abstract

Paper Structure (16 sections, 26 equations, 8 figures, 3 tables)

This paper contains 16 sections, 26 equations, 8 figures, 3 tables.

Introduction
Method
Model setting
The model-X knockoffs framework
DeepLINK-T: a new deep learning inference method for time series data
Simulation studies
The impacts of hyperparameters and model misspecification on DeepLINK-T
Comparisons of DeepLINK-T and DeepLINK
The impacts of number of subjects on DeepLINK-T
Real data applications
Application to longitudinal gut microbiome data of early infants
Application to marine metagenomic time series data
Identifying primary chlorophyll-a producer
Identifying taxa significantly associated with prokaryotic production
Application to dietary glycans treatment time series data
...and 1 more sections

Figures (8)

Figure 1: The structure of the LSTM autoencoder.
Figure 2: The structure of the LSTM prediction network.
Figure 3: The structure of the LSTM cell. $x$ is the input of the LSTM cell. $f$, $i$, and $o$ represent the forget, input, and output gates, respectively. $\tilde{c}_t$ denotes the input activation. $h$ and $c$ are hidden state and cell state, respectively. Subscript $t$ indicates the time step.
Figure 4: The impacts of bottleneck dimensionality and training epochs on DeepLINK-T using the linear factor model. A and B are FDR and power under the setting of linear link function. C and D are FDR and power under the setting of nonlinear link function. The number of training epochs of both LSTM autoencoder and LSTM prediction network is specified on the x-axis. The bottleneck dimensionality of the LSTM autoencoder is specified on the y-axis. The pre-specified FDR level is $q=0.2$.
Figure 5: The impacts of bottleneck dimensionality and training epochs on DeepLINK-T using the logistic factor model. A and B are FDR and power under the setting of linear link function. C and D are FDR and power under the setting of nonlinear link function. The number of training epochs of both LSTM autoencoder and LSTM prediction network is specified on the x-axis. The bottleneck dimensionality of the LSTM autoencoder is specified on the y-axis. The pre-specified FDR level is $q=0.2$.
...and 3 more figures

Theorems & Definitions (1)

Definition 1: 2018MXK

DeepLINK-T: deep learning inference for time series data using knockoffs and LSTM

TL;DR

Abstract

DeepLINK-T: deep learning inference for time series data using knockoffs and LSTM

Authors

TL;DR

Abstract

Table of Contents

Figures (8)

Theorems & Definitions (1)