DeepLINK-T: deep learning inference for time series data using knockoffs and LSTM
Wenxuan Zuo, Zifan Zhu, Yuxuan Du, Yi-Chun Yeh, Jed A. Fuhrman, Jinchi Lv, Yingying Fan, Fengzhu Sun
TL;DR
DeepLINK-T tackles FDR-controlled feature selection in high-dimensional longitudinal time series by coupling an LSTM-based autoencoder to generate time-series knockoffs with an LSTM predictor that uses original and knockoff features. By deriving knockoff statistics $W_j=Z_j^2- ilde{Z}_j^2$ through a plug-in pairwise-coupling framework and applying model-X knockoffs theory, it achieves finite-sample FDR control under complex temporal dependencies modeled by latent factors. Simulation studies show robust FDR control and superior power relative to non-time-series methods, and real metagenomic data applications demonstrate biologically plausible, reproducible feature identification across multiple datasets and taxonomic levels. The work highlights the practical utility of integrating deep learning with knockoffs in longitudinal data and suggests avenues for improvements, including Transformer-based architectures and end-to-end training. Overall, DeepLINK-T provides a principled, scalable approach for interpretable, reproducible variable selection in time-series regression problems with high dimensional covariates.
Abstract
High-dimensional longitudinal time series data is prevalent across various real-world applications. Many such applications can be modeled as regression problems with high-dimensional time series covariates. Deep learning has been a popular and powerful tool for fitting these regression models. Yet, the development of interpretable and reproducible deep-learning models is challenging and remains underexplored. This study introduces a novel method, Deep Learning Inference using Knockoffs for Time series data (DeepLINK-T), focusing on the selection of significant time series variables in regression while controlling the false discovery rate (FDR) at a predetermined level. DeepLINK-T combines deep learning with knockoff inference to control FDR in feature selection for time series models, accommodating a wide variety of feature distributions. It addresses dependencies across time and features by leveraging a time-varying latent factor structure in time series covariates. Three key ingredients for DeepLINK-T are 1) a Long Short-Term Memory (LSTM) autoencoder for generating time series knockoff variables, 2) an LSTM prediction network using both original and knockoff variables, and 3) the application of the knockoffs framework for variable selection with FDR control. Extensive simulation studies have been conducted to evaluate DeepLINK-T's performance, showing its capability to control FDR effectively while demonstrating superior feature selection power for high-dimensional longitudinal time series data compared to its non-time series counterpart. DeepLINK-T is further applied to three metagenomic data sets, validating its practical utility and effectiveness, and underscoring its potential in real-world applications.
