Table of Contents
Fetching ...

Timer-XL: Long-Context Transformers for Unified Time Series Forecasting

Yong Liu, Guo Qin, Xiangdong Huang, Jianmin Wang, Mingsheng Long

TL;DR

Timer-XL addresses the context bottleneck in time series forecasting by introducing a decoder-only Transformer that treats forecasting as multivariate next-token prediction. It introduces TimeAttention with a Kronecker-based masking scheme and RoPE temporal embeddings to capture fine-grained intra- and inter-series dependencies, enabling long-context forecasting with thousands of patch tokens ($NT$). The approach yields state-of-the-art results across univariate, multivariate, and covariate-informed benchmarks, and demonstrates strong zero-shot performance after large-scale pre-training, highlighting its potential as a foundation model for time series. The work provides practical mechanisms for incorporating covariates and exogenous variables while maintaining causality, offering scalable, one-for-all forecasting capabilities with broad applicability in real-world domains.

Abstract

We present Timer-XL, a causal Transformer for unified time series forecasting. To uniformly predict multidimensional time series, we generalize next token prediction, predominantly adopted for 1D token sequences, to multivariate next token prediction. The paradigm formulates various forecasting tasks as a long-context prediction problem. We opt for decoder-only Transformers that capture causal dependencies from varying-length contexts for unified forecasting, making predictions on non-stationary univariate time series, multivariate series with complicated dynamics and correlations, as well as covariate-informed contexts that include exogenous variables. Technically, we propose a universal TimeAttention to capture fine-grained intra- and inter-series dependencies of flattened time series tokens (patches), which is further enhanced by deft position embedding for temporal causality and variable equivalence. Timer-XL achieves state-of-the-art performance across task-specific forecasting benchmarks through a unified approach. Based on large-scale pre-training, Timer-XL achieves state-of-the-art zero-shot performance, making it a promising architecture for pre-trained time series models. Code is available at this repository: https://github.com/thuml/Timer-XL.

Timer-XL: Long-Context Transformers for Unified Time Series Forecasting

TL;DR

Timer-XL addresses the context bottleneck in time series forecasting by introducing a decoder-only Transformer that treats forecasting as multivariate next-token prediction. It introduces TimeAttention with a Kronecker-based masking scheme and RoPE temporal embeddings to capture fine-grained intra- and inter-series dependencies, enabling long-context forecasting with thousands of patch tokens (). The approach yields state-of-the-art results across univariate, multivariate, and covariate-informed benchmarks, and demonstrates strong zero-shot performance after large-scale pre-training, highlighting its potential as a foundation model for time series. The work provides practical mechanisms for incorporating covariates and exogenous variables while maintaining causality, offering scalable, one-for-all forecasting capabilities with broad applicability in real-world domains.

Abstract

We present Timer-XL, a causal Transformer for unified time series forecasting. To uniformly predict multidimensional time series, we generalize next token prediction, predominantly adopted for 1D token sequences, to multivariate next token prediction. The paradigm formulates various forecasting tasks as a long-context prediction problem. We opt for decoder-only Transformers that capture causal dependencies from varying-length contexts for unified forecasting, making predictions on non-stationary univariate time series, multivariate series with complicated dynamics and correlations, as well as covariate-informed contexts that include exogenous variables. Technically, we propose a universal TimeAttention to capture fine-grained intra- and inter-series dependencies of flattened time series tokens (patches), which is further enhanced by deft position embedding for temporal causality and variable equivalence. Timer-XL achieves state-of-the-art performance across task-specific forecasting benchmarks through a unified approach. Based on large-scale pre-training, Timer-XL achieves state-of-the-art zero-shot performance, making it a promising architecture for pre-trained time series models. Code is available at this repository: https://github.com/thuml/Timer-XL.
Paper Structure (63 sections, 15 equations, 12 figures, 16 tables)

This paper contains 63 sections, 15 equations, 12 figures, 16 tables.

Figures (12)

  • Figure 1: We compare the context length (measured by token number) of Transformers in different modalities and propose Timer-XL that increases the length to thousands of patch tokens. Given the generality across contexts, Timer-XL is a versatile solution for various forecasting tasks.
  • Figure 2: Illustration of TimeAttention. For univariate series, temporal mask $\mathcal{T}$ keeps the causality. Given multivariate patch tokens sorted in a temporal-first order, we adopt the variable dependencies $\mathcal{C}$, an all-one matrix, as the left-operand of Kronecker product, expanding temporal mask to a block matrix, which exactly reflects dependencies of multivariate next token prediction. The formulation is also generalizable to univariate and covariate-informed contexts with pre-defined variable dependency.
  • Figure 3: Univariate forecasting (pred-$96$) of well-acknowledged benchmarks under channel independence nie2022time. We increase the lookback length to encompass monthly and yearly contexts.
  • Figure 4: Multivariate forecasting of GTWSF ($2$-day-pred-1-day), involving 3850 worldwide stations spanning two years. Results of the baseline models are officially reported by ding2024unireplknet.
  • Figure 5: Illustration of one-for-all generalization (left). Based on the contextual flexibility, Timer-XL can predict heterogeneous time series, indicating three directions of generalization shown on the left. We compare performance when generalizing across the time and variables (middle), and zero-shot results across datasets (right), emphasizing the benefit of long-context pre-training.
  • ...and 7 more figures