Leveraging 2D Information for Long-term Time Series Forecasting with Vanilla Transformers

Xin Cheng; Xiuying Chen; Shuqi Li; Di Luo; Xun Wang; Dongyan Zhao; Rui Yan

Leveraging 2D Information for Long-term Time Series Forecasting with Vanilla Transformers

Xin Cheng, Xiuying Chen, Shuqi Li, Di Luo, Xun Wang, Dongyan Zhao, Rui Yan

TL;DR

GridTST tackles long-term multivariate time-series forecasting by reformatting data into a time-variates grid and applying bidirectional, vanilla Transformer attention. By using patched time tokens (horizontal) and variate tokens (vertical), the model captures both temporal dynamics and cross-variable relationships, enhanced by a patching strategy to retain local semantics. Empirical results across seven real-world datasets show state-of-the-art performance and robustness to varying lookback lengths, with ablations and visualizations confirming the effectiveness of the dual-attention design. The approach demonstrates the practicality of vanilla Transformer architectures for complex time-series tasks and opens avenues for extending grid-based strategies to multi-modal data and broader impacts.

Abstract

Time series prediction is crucial for understanding and forecasting complex dynamics in various domains, ranging from finance and economics to climate and healthcare. Based on Transformer architecture, one approach involves encoding multiple variables from the same timestamp into a single temporal token to model global dependencies. In contrast, another approach embeds the time points of individual series into separate variate tokens. The former method faces challenges in learning variate-centric representations, while the latter risks missing essential temporal information critical for accurate forecasting. In our work, we introduce GridTST, a model that combines the benefits of two approaches using innovative multi-directional attentions based on a vanilla Transformer. We regard the input time series data as a grid, where the $x$-axis represents the time steps and the $y$-axis represents the variates. A vertical slicing of this grid combines the variates at each time step into a \textit{time token}, while a horizontal slicing embeds the individual series across all time steps into a \textit{variate token}. Correspondingly, a \textit{horizontal attention mechanism} focuses on time tokens to comprehend the correlations between data at various time steps, while a \textit{vertical}, variate-aware \textit{attention} is employed to grasp multivariate correlations. This combination enables efficient processing of information across both time and variate dimensions, thereby enhancing the model's analytical strength. % We also integrate the patch technique, segmenting time tokens into subseries-level patches, ensuring that local semantic information is retained in the embedding. The GridTST model consistently delivers state-of-the-art performance across various real-world datasets.

Leveraging 2D Information for Long-term Time Series Forecasting with Vanilla Transformers

TL;DR

Abstract

-axis represents the time steps and the

-axis represents the variates. A vertical slicing of this grid combines the variates at each time step into a \textit{time token}, while a horizontal slicing embeds the individual series across all time steps into a \textit{variate token}. Correspondingly, a \textit{horizontal attention mechanism} focuses on time tokens to comprehend the correlations between data at various time steps, while a \textit{vertical}, variate-aware \textit{attention} is employed to grasp multivariate correlations. This combination enables efficient processing of information across both time and variate dimensions, thereby enhancing the model's analytical strength. % We also integrate the patch technique, segmenting time tokens into subseries-level patches, ensuring that local semantic information is retained in the embedding. The GridTST model consistently delivers state-of-the-art performance across various real-world datasets.

Paper Structure (16 sections, 4 equations, 5 figures, 10 tables)

This paper contains 16 sections, 4 equations, 5 figures, 10 tables.

Introduction
Related Work
GridTST
Problem Formulation
Model Structure
EXPERIMENTS
Prediction Performance
Visualization Case Study
Ablation study
Scalability of GridTST
Efficient training strategy
Conclusion and Broader Impacts
Dataset
Comparison with existing baselines
Increasing lookback length
...and 1 more sections

Figures (5)

Figure 1: Comparison of the vanilla Transformer (a), inverse Transformer (b), and our proposed GridTST (c). Unlike baseline transformers that embed time steps into temporal tokens or the entire series into variate tokens separately, GridTST models both simultaneously. This approach captures multivariate and multi-time-step correlations using a bi-directional attention mechanism.
Figure 2: Overview of our proposed GridTST. Firstly, we transform the inputs by breaking them down into grids. These grids undergo processing via vanilla transformer attention, incorporating distinct horizontal and vertical directions. Finally, our model generates projected prediction results.
Figure 3: Forecasting Performance with Lookback Length $T \in \{96, 172, 336, 512, 720\}$ and Fixed Prediction Length $F = 96$. The performance of time-centric PatchTST or variate-centric iTransformer forecasters does not markedly improve with increased lookback length. In contrast, our GridTST framework enhances the vanilla Transformer, yielding improved performance when utilizing an enlarged lookback window.
Figure 4: Visualization of attention maps and time series forecasts from the Traffic dataset. For each time series, the input data is represented in blue, the GridTST model's predictions in orange, and the actual observed values in green. The three black demarcation lines indicate the three segmented patches. The synergy of horizontal and vertical attention mechanisms enables the model to refine its forecasts by concentrating on the spatial and temporal information deemed most pertinent.
Figure 5: Analysis of the Proposed Training Strategy. The performance (Left) maintains stable across partially trained variants of each batch, with the sampled ratio ranging from 20% to 100%. Concurrently, there is a notable reduction in both the memory footprint (Middle) and the latency (Right) of the training process.

Leveraging 2D Information for Long-term Time Series Forecasting with Vanilla Transformers

TL;DR

Abstract

Leveraging 2D Information for Long-term Time Series Forecasting with Vanilla Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (5)