Table of Contents
Fetching ...

Are Self-Attentions Effective for Time Series Forecasting?

Dongbin Kim, Jinseong Park, Jaewook Lee, Hoki Kim

TL;DR

This paper investigates whether self-attention is essential for time-series forecasting and introduces CATS, a cross-attention-only transformer that uses horizon-dependent queries, parameter sharing, and query-adaptive masking. Across seven real-world datasets, CATS achieves state-of-the-art or competitive forecasting accuracy while using fewer parameters and less memory than contemporary transformer-based models. The study provides not only strong empirical results for long- and short-term forecasting but also interpretable attention maps that reveal horizon-specific periodic patterns. By challenging the central role of self-attention in time-series modeling, the work offers a more efficient architectural paradigm with practical implications for scalable real-world forecasting.

Abstract

Time series forecasting is crucial for applications across multiple domains and various scenarios. Although Transformer models have dramatically advanced the landscape of forecasting, their effectiveness remains debated. Recent findings have indicated that simpler linear models might outperform complex Transformer-based approaches, highlighting the potential for more streamlined architectures. In this paper, we shift the focus from evaluating the overall Transformer architecture to specifically examining the effectiveness of self-attention for time series forecasting. To this end, we introduce a new architecture, Cross-Attention-only Time Series transformer (CATS), that rethinks the traditional Transformer framework by eliminating self-attention and leveraging cross-attention mechanisms instead. By establishing future horizon-dependent parameters as queries and enhanced parameter sharing, our model not only improves long-term forecasting accuracy but also reduces the number of parameters and memory usage. Extensive experiment across various datasets demonstrates that our model achieves superior performance with the lowest mean squared error and uses fewer parameters compared to existing models. The implementation of our model is available at: https://github.com/dongbeank/CATS.

Are Self-Attentions Effective for Time Series Forecasting?

TL;DR

This paper investigates whether self-attention is essential for time-series forecasting and introduces CATS, a cross-attention-only transformer that uses horizon-dependent queries, parameter sharing, and query-adaptive masking. Across seven real-world datasets, CATS achieves state-of-the-art or competitive forecasting accuracy while using fewer parameters and less memory than contemporary transformer-based models. The study provides not only strong empirical results for long- and short-term forecasting but also interpretable attention maps that reveal horizon-specific periodic patterns. By challenging the central role of self-attention in time-series modeling, the work offers a more efficient architectural paradigm with practical implications for scalable real-world forecasting.

Abstract

Time series forecasting is crucial for applications across multiple domains and various scenarios. Although Transformer models have dramatically advanced the landscape of forecasting, their effectiveness remains debated. Recent findings have indicated that simpler linear models might outperform complex Transformer-based approaches, highlighting the potential for more streamlined architectures. In this paper, we shift the focus from evaluating the overall Transformer architecture to specifically examining the effectiveness of self-attention for time series forecasting. To this end, we introduce a new architecture, Cross-Attention-only Time Series transformer (CATS), that rethinks the traditional Transformer framework by eliminating self-attention and leveraging cross-attention mechanisms instead. By establishing future horizon-dependent parameters as queries and enhanced parameter sharing, our model not only improves long-term forecasting accuracy but also reduces the number of parameters and memory usage. Extensive experiment across various datasets demonstrates that our model achieves superior performance with the lowest mean squared error and uses fewer parameters compared to existing models. The implementation of our model is available at: https://github.com/dongbeank/CATS.
Paper Structure (29 sections, 2 equations, 17 figures, 18 tables)

This paper contains 29 sections, 2 equations, 17 figures, 18 tables.

Figures (17)

  • Figure 1: Experimental results illustrating the mean squared error (MSE) and the number of parameters with varying input sequence lengths on ETTm1. Each bubble represents a different model, with the bubble size indicating the number of parameters in millions—larger bubbles denote models with more parameters. Our model consistently shows the lowest MSE (i.e., best performance) with fewer parameters even for longer input sequences. The detailed results can be found in Table \ref{['table:ettm1_param']}.
  • Figure 1: Effect of self-attention in PatchTST on forecasting performance (MSE) on ETTm1.
  • Figure 2: Absolute values of weights in the final linear layer for different PatchTST variations. The distinct patterns reveal how each model captures temporal information.
  • Figure 3: Illustration of existing time series forecasting architectures and the proposed architecture.
  • Figure 3: Effect of parameter sharing across horizons on the number of parameters for different forecasting horizons on ETTh1.
  • ...and 12 more figures