Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond

Yekun Ke; Yingyu Liang; Zhenmei Shi; Zhao Song; Chiwun Yang

Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond

Yekun Ke, Yingyu Liang, Zhenmei Shi, Zhao Song, Chiwun Yang

TL;DR

The paper addresses why Transformer models often fail to generalize in time series forecasting by presenting the first theoretical explanation based on asymmetric learning in attention training. Using a two-layer attention model analyzed through the Neural Tangent Kernel (NTK) framework, it shows that parameter updates tend to align along the direction of the value weights, hindering the learning of residual (core) features when sign relations between successive steps are inconsistent. It demonstrates that linear residual networks can generalize to out-of-distribution (OOD) data under sign-inconsistent next-step-prediction tasks, while attention cannot, and it provides generalizations and potential remedies such as Differential Transformer, RoPE, and patching. Collectively, these results offer a principled foundation for designing more expressive and efficient transformer-based architectures for time series forecasting and related sequential tasks.

Abstract

The application of transformer-based models on time series forecasting (TSF) tasks has long been popular to study. However, many of these works fail to beat the simple linear residual model, and the theoretical understanding of this issue is still limited. In this work, we propose the first theoretical explanation of the inefficiency of transformers on TSF tasks. We attribute the mechanism behind it to {\bf Asymmetric Learning} in training attention networks. When the sign of the previous step is inconsistent with the sign of the current step in the next-step-prediction time series, attention fails to learn the residual features. This makes it difficult to generalize on out-of-distribution (OOD) data, especially on the sign-inconsistent next-step-prediction data, with the same representation pattern, whereas a linear residual network could easily accomplish it. We hope our theoretical insights provide important necessary conditions for designing the expressive and efficient transformer-based architecture for practitioners.

Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond

TL;DR

Abstract

Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (74)