SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention

Romain Ilbert; Ambroise Odonnat; Vasilii Feofanov; Aladin Virmaux; Giuseppe Paolo; Themis Palpanas; Ievgen Redko

SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention

Romain Ilbert, Ambroise Odonnat, Vasilii Feofanov, Aladin Virmaux, Giuseppe Paolo, Themis Palpanas, Ievgen Redko

TL;DR

This paper examines why transformer models underperform in multivariate long-horizon time-series forecasting and identifies attention-related trainability issues as a key culprit. It proposes SAMformer, a shallow transformer with channel-wise attention, RevIN normalization, and sharpness-aware minimization to achieve stable, generalizable learning. Across eight real-world datasets, SAMformer surpasses state-of-the-art baselines while using far fewer parameters, and it shows smoother loss landscapes and robustness to initialization. The work demonstrates that careful training strategies can unlock the potential of simple transformer architectures for efficient, scalable multivariate forecasting with practical impact for real-world applications.

Abstract

Transformer-based architectures achieved breakthrough performance in natural language processing and computer vision, yet they remain inferior to simpler linear baselines in multivariate long-term forecasting. To better understand this phenomenon, we start by studying a toy linear forecasting problem for which we show that transformers are incapable of converging to their true solution despite their high expressive power. We further identify the attention of transformers as being responsible for this low generalization capacity. Building upon this insight, we propose a shallow lightweight transformer model that successfully escapes bad local minima when optimized with sharpness-aware optimization. We empirically demonstrate that this result extends to all commonly used real-world multivariate time series datasets. In particular, SAMformer surpasses current state-of-the-art methods and is on par with the biggest foundation model MOIRAI while having significantly fewer parameters. The code is available at https://github.com/romilbert/samformer.

SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention

TL;DR

Abstract

Paper Structure (61 sections, 43 equations, 15 figures, 10 tables, 1 algorithm)

This paper contains 61 sections, 43 equations, 15 figures, 10 tables, 1 algorithm.

Introduction
Limitation of current approaches.
Trainability of transformers.
Summary of our contributions.
Proposed Approach
Notations.
Problem Setup
Motivational Example
Transformer's Loss Landscape
Intuition.
Existing solutions.
SAMformer: Putting It All Together
Experiments
Datasets.
Baselines.
...and 46 more sections

Figures (15)

Figure 1: Illustration of our approach on synthetic data. Oracle is the optimal solution, Transformer is a base transformer, $\sigma$Reparam is a Transformer with weight rescaling zhai2023collapse and Transformer + SAM is Transformer trained with sharpness-aware minimization. Transformer overfits, $\sigma$Reparam improves slightly but fails to reach Oracle while Transformer+SAM generalizes perfectly. This motivates SAMformer, a shallow transformer combining SAM and best practices in time series forecasting.
Figure 2: Poor generalization. Despite its simplicity, Transformer suffers from severe overfitting. Fixing the attention weights in Random Transformer improves the generalization, hinting at the role of attention in preventing convergence to optimal local minima.
Figure 3: Transformer's loss landscape analysis for linear regression. (a) The attention matrices of Transformer get stuck to identity from the first epoch. (b, left)Transformer converges to sharper minimum than Transformer+SAM with much larger $\lambda_\mathrm{max}$ ($\sim \times 10^4)$, while Random Transformer has a smooth loss landscape. (b, right)Transformer suffers from entropy collapse during training confirming the high sharpness of its loss landscape.
Figure 4: SAM -former
Figure 5: (a)SAMformer has a smoother loss landscape than Transformer. (b)SAMformer consistently generalize well for every initialization while Transformer is unstable and heavily depends on the seed.
...and 10 more figures

Theorems & Definitions (11)

Remark 4.1: Interpretation of $\rho$
proof
proof
proof
proof
proof
proof
proof
proof
proof
...and 1 more

SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention

TL;DR

Abstract

SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention

Authors

TL;DR

Abstract

Table of Contents

Figures (15)

Theorems & Definitions (11)