Table of Contents
Fetching ...

Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond

Yekun Ke, Yingyu Liang, Zhenmei Shi, Zhao Song, Chiwun Yang

TL;DR

The paper addresses why Transformer models often fail to generalize in time series forecasting by presenting the first theoretical explanation based on asymmetric learning in attention training. Using a two-layer attention model analyzed through the Neural Tangent Kernel (NTK) framework, it shows that parameter updates tend to align along the direction of the value weights, hindering the learning of residual (core) features when sign relations between successive steps are inconsistent. It demonstrates that linear residual networks can generalize to out-of-distribution (OOD) data under sign-inconsistent next-step-prediction tasks, while attention cannot, and it provides generalizations and potential remedies such as Differential Transformer, RoPE, and patching. Collectively, these results offer a principled foundation for designing more expressive and efficient transformer-based architectures for time series forecasting and related sequential tasks.

Abstract

The application of transformer-based models on time series forecasting (TSF) tasks has long been popular to study. However, many of these works fail to beat the simple linear residual model, and the theoretical understanding of this issue is still limited. In this work, we propose the first theoretical explanation of the inefficiency of transformers on TSF tasks. We attribute the mechanism behind it to {\bf Asymmetric Learning} in training attention networks. When the sign of the previous step is inconsistent with the sign of the current step in the next-step-prediction time series, attention fails to learn the residual features. This makes it difficult to generalize on out-of-distribution (OOD) data, especially on the sign-inconsistent next-step-prediction data, with the same representation pattern, whereas a linear residual network could easily accomplish it. We hope our theoretical insights provide important necessary conditions for designing the expressive and efficient transformer-based architecture for practitioners.

Curse of Attention: A Kernel-Based Perspective for Why Transformers Fail to Generalize on Time Series Forecasting and Beyond

TL;DR

The paper addresses why Transformer models often fail to generalize in time series forecasting by presenting the first theoretical explanation based on asymmetric learning in attention training. Using a two-layer attention model analyzed through the Neural Tangent Kernel (NTK) framework, it shows that parameter updates tend to align along the direction of the value weights, hindering the learning of residual (core) features when sign relations between successive steps are inconsistent. It demonstrates that linear residual networks can generalize to out-of-distribution (OOD) data under sign-inconsistent next-step-prediction tasks, while attention cannot, and it provides generalizations and potential remedies such as Differential Transformer, RoPE, and patching. Collectively, these results offer a principled foundation for designing more expressive and efficient transformer-based architectures for time series forecasting and related sequential tasks.

Abstract

The application of transformer-based models on time series forecasting (TSF) tasks has long been popular to study. However, many of these works fail to beat the simple linear residual model, and the theoretical understanding of this issue is still limited. In this work, we propose the first theoretical explanation of the inefficiency of transformers on TSF tasks. We attribute the mechanism behind it to {\bf Asymmetric Learning} in training attention networks. When the sign of the previous step is inconsistent with the sign of the current step in the next-step-prediction time series, attention fails to learn the residual features. This makes it difficult to generalize on out-of-distribution (OOD) data, especially on the sign-inconsistent next-step-prediction data, with the same representation pattern, whereas a linear residual network could easily accomplish it. We hope our theoretical insights provide important necessary conditions for designing the expressive and efficient transformer-based architecture for practitioners.

Paper Structure

This paper contains 61 sections, 26 theorems, 187 equations, 1 figure.

Key Result

Lemma 5.1

For $\delta \in (0,0.1)$, $B = \max \{1,$$\sqrt{(1+\sigma^2)\log(nN/\delta)}\}$ and $D = \max\{\sqrt{\log(m/\delta)},1\}$. For any $r \in [m]$, we have $|w_r(t)-w_r(0)| \leq R$ and let $R \leq \frac{\lambda}{n \mathop{\mathrm{poly}}\nolimits(\exp(B^2),\exp(D)}$ . Then with probability at least $1-\d

Figures (1)

  • Figure 1: (a) We compare the work of previous model ydjy17wxwl21zmww+22lyll+22zczx23 on the benchmark dataset ETTh1 and ETTh2. The experimental results show that, even though the simple linear models, NLinear and DLinear, have far fewer parameters than Transformer-based models, they exhibit superior generalization ability on TSF tasks. (b) Theoretical-expected gradient direction of training transformer-based model on TSF tasks. In our setup, we focus on the features at the last time step (also referred to as core features), denoted as $x_{k+1}$ and the features at previous time steps (also referred to as background features), denoted as $x_{k}$ ($k \in [d]$). Our theoretical findings suggest that the asymmetric feature updates in attention make it difficult for the attention mechanism to learn the recent residual features when the directions of $x_{k+1}$ and $x_{k}$ are not aligned. In detail, the gradient when training data satisfies $x_k \cdot x_{k+1} < 0$ is contaminated by background features due to the learning disadvantage of attention.

Theorems & Definitions (74)

  • Definition 4.1: State space model (SSM), informal version of Lemma \ref{['def:ssm']}
  • Claim 4.2: Residual SSM for generating data, informal version of Claim \ref{['clm:res_ssm']}
  • Definition 4.3: Data Generation, informal version of Definition \ref{['def:id_generator']}
  • Lemma 5.1: Kernel Convergence, informal version of Lemma \ref{['lem:kernel_pd_formal']}
  • proof : Proof sketch of Lemma \ref{['lem:kernel_pd_informal']}
  • Theorem 5.2: Informal version of Theorem \ref{['thm:convergence']}
  • proof : Proof sketch of Theorem \ref{['thm:convergence:informal']}
  • Theorem 5.3: Attention fails to learn residual feature, informal version of Theorem \ref{['thm:residual_failure']}
  • Definition 6.1
  • Definition 6.2
  • ...and 64 more