Revisiting Attention for Multivariate Time Series Forecasting

Haixiang Wu

Revisiting Attention for Multivariate Time Series Forecasting

Haixiang Wu

TL;DR

This work questions whether conventional attention is optimal for multivariate time series forecasting and proposes two alternatives, FSatten and SOatten, that operate in frequency-domain and learnable orthogonal latent spaces, respectively. FSatten uses FFT embeddings and MSS to capture frequency correlations between sequences, while SOatten generalizes to an orthogonal embedding and adds a Head Coupling Convolution to guide pattern learning. Across six real-world datasets, both methods outperform state-of-the-art attention-based models, with notable gains on periodic data and robust performance across architectures. The results suggest attention mappings in the latent space can be substantially improved by incorporating domain-informed (frequency) or orthogonally constrained representations, offering a versatile, architecture-agnostic enhancement for MTSF and potential cross-domain applications.

Abstract

Current Transformer methods for Multivariate Time-Series Forecasting (MTSF) are all based on the conventional attention mechanism. They involve sequence embedding and performing a linear projection of Q, K, and V, and then computing attention within this latent space. We have never delved into the attention mechanism to explore whether such a mapping space is optimal for MTSF. To investigate this issue, this study first proposes Frequency Spectrum attention (FSatten), a novel attention mechanism based on the frequency domain space. It employs the Fourier transform for embedding and introduces Multi-head Spectrum Scaling (MSS) to replace the conventional linear mapping of Q and K. FSatten can accurately capture the periodic dependencies between sequences and outperform the conventional attention without changing mainstream architectures. We further design a more general method dubbed Scaled Orthogonal attention (SOatten). We propose an orthogonal embedding and a Head-Coupling Convolution (HCC) based on the neighboring similarity bias to guide the model in learning comprehensive dependency patterns. Experiments show that FSatten and SOatten surpass the SOTA which uses conventional attention, making it a good alternative as a basic attention mechanism for MTSF. The codes and log files will be released at: https://github.com/Joeland4/FSatten-SOatten.

Revisiting Attention for Multivariate Time Series Forecasting

TL;DR

Abstract

Paper Structure (27 sections, 10 equations, 15 figures, 7 tables)

This paper contains 27 sections, 10 equations, 15 figures, 7 tables.

Introduction
Preliminaries
Temporal Transformer
Variate Transformer
FSatten
Workflow
MSS
SOatten
Method
HCC
Experiments
Long-term MTSF
Ablation Studies
Visualized Analysis
Hyperparameter Sensitivity
...and 12 more sections

Figures (15)

Figure 1: Performance of FSatten and SOatten.
Figure 2: (Left) Temporal Transformer and Variate Transformer. (Right) Comparison of mapping space from conventional attention and FSatten
Figure 3: (left) Multi-Head Attention. (right) FSatten. On the left side of the figures is the shape of the data at each stage, and adding batch size to the front is the shape in training
Figure 4: Multi-head Spectrum Scaling. After the Fast Fourier Transform (FFT), the correlated frequency components within the frequency domain between A and B are determined by scaled amplitude values as indicated by the purple points.
Figure 5: (left) SOatten. (right) Scaled Dot-Product Attention with HCC. On the left side of the Soatten is the shape of the data at each stage, and adding batch size to the front is the shape in training.
...and 10 more figures

Revisiting Attention for Multivariate Time Series Forecasting

TL;DR

Abstract

Revisiting Attention for Multivariate Time Series Forecasting

Authors

TL;DR

Abstract

Table of Contents

Figures (15)