Table of Contents
Fetching ...

Attention as Robust Representation for Time Series Forecasting

PeiSong Niu, Tian Zhou, Xue Wang, Liang Sun, Rong Jin

TL;DR

This paper tackles the vulnerability of time series forecasting to noise and distribution shifts by recasting attention as the core data representation. It introduces AttnEmbed, which uses windowed attention with global landmarks and EMA to produce robust embeddings, and augments this with kernel-based attention variants (RBF and polynomial) to better capture similarities. Empirical results on seven real-world datasets show AttnEmbed achieving state-of-the-art or competitive performance, with an average relative MSE reduction of 3.6% over PatchTST and notable gains on noisier datasets; AttnEmbed also functions as a versatile plug-in across architectures. Theoretical and empirical analyses support robustness to noise and rank stability, highlighting practical impact for building more reliable, transformer-based time series forecasting systems. Overall, AttnEmbed provides a modular, general-purpose enhancement for time series embeddings that improves forecasting accuracy while maintaining compatibility with existing models.

Abstract

Time series forecasting is essential for many practical applications, with the adoption of transformer-based models on the rise due to their impressive performance in NLP and CV. Transformers' key feature, the attention mechanism, dynamically fusing embeddings to enhance data representation, often relegating attention weights to a byproduct role. Yet, time series data, characterized by noise and non-stationarity, poses significant forecasting challenges. Our approach elevates attention weights as the primary representation for time series, capitalizing on the temporal relationships among data points to improve forecasting accuracy. Our study shows that an attention map, structured using global landmarks and local windows, acts as a robust kernel representation for data points, withstanding noise and shifts in distribution. Our method outperforms state-of-the-art models, reducing mean squared error (MSE) in multivariate time series forecasting by a notable 3.6% without altering the core neural network architecture. It serves as a versatile component that can readily replace recent patching based embedding schemes in transformer-based models, boosting their performance.

Attention as Robust Representation for Time Series Forecasting

TL;DR

This paper tackles the vulnerability of time series forecasting to noise and distribution shifts by recasting attention as the core data representation. It introduces AttnEmbed, which uses windowed attention with global landmarks and EMA to produce robust embeddings, and augments this with kernel-based attention variants (RBF and polynomial) to better capture similarities. Empirical results on seven real-world datasets show AttnEmbed achieving state-of-the-art or competitive performance, with an average relative MSE reduction of 3.6% over PatchTST and notable gains on noisier datasets; AttnEmbed also functions as a versatile plug-in across architectures. Theoretical and empirical analyses support robustness to noise and rank stability, highlighting practical impact for building more reliable, transformer-based time series forecasting systems. Overall, AttnEmbed provides a modular, general-purpose enhancement for time series embeddings that improves forecasting accuracy while maintaining compatibility with existing models.

Abstract

Time series forecasting is essential for many practical applications, with the adoption of transformer-based models on the rise due to their impressive performance in NLP and CV. Transformers' key feature, the attention mechanism, dynamically fusing embeddings to enhance data representation, often relegating attention weights to a byproduct role. Yet, time series data, characterized by noise and non-stationarity, poses significant forecasting challenges. Our approach elevates attention weights as the primary representation for time series, capitalizing on the temporal relationships among data points to improve forecasting accuracy. Our study shows that an attention map, structured using global landmarks and local windows, acts as a robust kernel representation for data points, withstanding noise and shifts in distribution. Our method outperforms state-of-the-art models, reducing mean squared error (MSE) in multivariate time series forecasting by a notable 3.6% without altering the core neural network architecture. It serves as a versatile component that can readily replace recent patching based embedding schemes in transformer-based models, boosting their performance.
Paper Structure (37 sections, 40 equations, 5 figures, 8 tables)

This paper contains 37 sections, 40 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Comparison between AttnEmbed (ours) and PatchTST on synthetic data. (a) Non-stationary. (b) Noise reduction.
  • Figure 2: The architecture of (a) AttnEmbed and a comparison with (b) PatchTST. Unlike PatchTST, AttnEmbed considers the relationship of time steps within each window.
  • Figure 3: Detail of attention embedding.
  • Figure 4: Parameter analysis on ETTh1 with a lookback window of 96 and a horizon of 96. (a) Window size. (b) Decrease coefficients of EMA. (c) Stride sizes of global Conv1D. (d) Layer numbers of the attention embedding module.
  • Figure 5: Relative norm of the residual along the depth for PatchTST, Attention, RBF kernel and polynomial kernel with different layers ([3, 6]) of transformer encoder on ETTh1.