TimelyGPT: Extrapolatable Transformer Pre-training for Long-term Time-Series Forecasting in Healthcare

Ziyang Song; Qincheng Lu; Hao Xu; He Zhu; David L. Buckeridge; Yue Li

TimelyGPT: Extrapolatable Transformer Pre-training for Long-term Time-Series Forecasting in Healthcare

Ziyang Song, Qincheng Lu, Hao Xu, He Zhu, David L. Buckeridge, Yue Li

TL;DR

TimelyGPT tackles the challenge of long-term forecasting in healthcare time-series by extending Transformer-based pre-training with an extrapolatable position embedding ($xPos$), Retention-based global attention, and local temporal convolutions. The model supports efficient linear training and constant-time inference while enabling extrapolation beyond training horizons, addressing the limitations of conventional self-attention for long sequences. It is pre-trained on unlabeled large-scale biosignal and EHR-like data and fine-tuned for downstream tasks, achieving strong extrapolation up to 6,000 timesteps and high recall for irregularly-sampled diagnoses. The approach demonstrates a scalable, transferable framework for long-term patient health state forecasting and risk trajectory modeling in healthcare domains. Potential impact includes improved long-range monitoring and earlier intervention through robust, data-efficient pre-training on diverse healthcare time-series.

Abstract

Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success in Natural Language Processing and Computer Vision domains. However, the development of PTMs on healthcare time-series data is lagging behind.This underscores the limitations of the existing transformer-based architectures, particularly their scalability to handle large-scale time series and ability to capture long-term temporal dependencies. In this study, we present Timely Generative Pre-trained Transformer (TimelyGPT). TimelyGPT employs an extrapolatable position (xPos) embedding to encode trend and periodic patterns into time-series representations. It also integrates recurrent attention and temporal convolution modules to effectively capture global-local temporal dependencies. We evaluated TimelyGPT on two large-scale healthcare time series datasets corresponding to continuous biosignals and irregularly-sampled time series, respectively. Our experiments show that during pre-training, TimelyGPT excels in learning time-series representations from continuously monitored biosignals and irregularly-sampled time series data commonly observed in longitudinal electronic health records (EHRs). In forecasting continuous biosignals, TimelyGPT achieves accurate extrapolation up to 6,000 timesteps of body temperature during the sleep stage transition, given a short look-up window (i.e., prompt) containing only 2,000 timesteps. For irregularly-sampled time series, TimelyGPT with a proposed time-specific inference demonstrates high top recall scores in predicting future diagnoses using early diagnostic records, effectively handling irregular intervals between clinical records. Together, we envision TimelyGPT to be useful in a broad spectrum of health domains, including long-term patient health state forecasting and patient risk trajectory prediction.

TimelyGPT: Extrapolatable Transformer Pre-training for Long-term Time-Series Forecasting in Healthcare

TL;DR

TimelyGPT tackles the challenge of long-term forecasting in healthcare time-series by extending Transformer-based pre-training with an extrapolatable position embedding (

), Retention-based global attention, and local temporal convolutions. The model supports efficient linear training and constant-time inference while enabling extrapolation beyond training horizons, addressing the limitations of conventional self-attention for long sequences. It is pre-trained on unlabeled large-scale biosignal and EHR-like data and fine-tuned for downstream tasks, achieving strong extrapolation up to 6,000 timesteps and high recall for irregularly-sampled diagnoses. The approach demonstrates a scalable, transferable framework for long-term patient health state forecasting and risk trajectory modeling in healthcare domains. Potential impact includes improved long-range monitoring and earlier intervention through robust, data-efficient pre-training on diverse healthcare time-series.

Abstract

Paper Structure (32 sections, 19 equations, 10 figures, 5 tables)

This paper contains 32 sections, 19 equations, 10 figures, 5 tables.

Introduction
Related work
Self-attention in Transformer
Position embedding in Transformer
TimelyGPT Methodology
Extrapolatable position embedding encodes temporal patterns
Retention for continuous and irregularly-sampled time series
Convolution modules for local interaction
Computational complexity
Data
Sleep-EDF dataset
PopHR database
Experiments
Pre-training and fine-tuning
Jointly forecasting multivariate biosignals from Sleep-EDF dataset
...and 17 more sections

Figures (10)

Figure 1: TimelyGPT overview. a. TimelyGPT architecture. TimelyGPT consists of a convolution-subsampling tokenizer followed by $L$ decoder layers, with detailed overflow provided in Appendix \ref{['sec: overflow']}. b. Generative decoder with xPos embedding. Each decoder layer is coupled with extrapolatable position embedding (Section \ref{['sec:xPos']}) that encodes trend and periodic patterns into representations, facilitating forecasting with extrapolation ability. c.Chunk-wise Retention. This module consists of parallel intra-chunk Retention and recurrent inter-chunk Retention, effectively handling long sequences in continuously monitored biosignals (Appendix \ref{['equivalence']}). d.Temporal Convolution (Section \ref{['sec: convolution']}) captures nuanced local interactions from time-series representations.
Figure 2: Two inference strategies for forecasting irregularly-sampled time series. (a) Trajectory-based inference. TimelyGPT autoregressively predicts the entire sequence at equal time intervals. The target intervals can then be taken from part of the inferred trajectory. (b) Time-specific inference. TimelyGPT directly predicts the target data point using historical hidden states and the gap between the target timestep and the last observed timestep.
Figure 3: Test MAE of forecasting Sleep-EDF biosignals as a function of dataset sizes and parameter sizes. Both look-up and forecasting windows were set to 256 timesteps. TimelyGPT with more parameters tends to exhibit better performance when trained on larger datasets.
Figure 4: SleepEDF biosignal forecasting performances of TimelyGPT and seven state-of-the-art methods over various forecasting windows. a. MAE for 8 methods evaluated over 3 forecasting windows (720, 2000, and 6000 timesteps). b. Cross-correlation scores for the same methods and forecasting windows. The detailed numerical results are summarized in Table \ref{['tab:forecasting_comp']}.
Figure 5: Predicted sequence of SleepEDF biosignals of 6,000 timesteps. Given a 2,000 look-up window, we applied TimelyGPT (blue solid line) and 4 state-of-the-art methods (dashed lines) to predict the biosignals for the next 6,000 timesteps. The groundtruth biosignals are displayed as red solid line. The two vertical lines demarcate the look-up window and the length of pre-training sequences, respectively.
...and 5 more figures

TimelyGPT: Extrapolatable Transformer Pre-training for Long-term Time-Series Forecasting in Healthcare

TL;DR

Abstract

TimelyGPT: Extrapolatable Transformer Pre-training for Long-term Time-Series Forecasting in Healthcare

Authors

TL;DR

Abstract

Table of Contents

Figures (10)