Table of Contents
Fetching ...

WAVE: Weighted Autoregressive Varying Gate for Time Series Forecasting

Jiecheng Lu, Xu Han, Yan Sun, Shihao Yang

TL;DR

The paper tackles time series forecasting with decoder-only autoregressive Transformers and introduces WAVE, a weighted autoregressive varying gate that injects a moving-average term into AR attention via an indirect MA weight generation scheme. This design preserves $O(N)$ time complexity and the original parameter count while decoupling short-term local effects from long-term patterns, enabling improved modeling of seasonal and cyclic temporal dependencies. Across 12 real-world TSF datasets, WAVE consistently outperforms AR baselines and achieves competitive or state-of-the-art results, with notable gains from linear attention variants and minimal additional computational cost. The work offers a practical, scalable approach to enhancing autoregressive Transformers for time series, with potential extensions to multivariate forecasting and broader sequence modeling tasks.

Abstract

We propose a Weighted Autoregressive Varying gatE (WAVE) attention mechanism equipped with both Autoregressive (AR) and Moving-average (MA) components. It can adapt to various attention mechanisms, enhancing and decoupling their ability to capture long-range and local temporal patterns in time series data. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that WAVE attention that incorporates the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.

WAVE: Weighted Autoregressive Varying Gate for Time Series Forecasting

TL;DR

The paper tackles time series forecasting with decoder-only autoregressive Transformers and introduces WAVE, a weighted autoregressive varying gate that injects a moving-average term into AR attention via an indirect MA weight generation scheme. This design preserves time complexity and the original parameter count while decoupling short-term local effects from long-term patterns, enabling improved modeling of seasonal and cyclic temporal dependencies. Across 12 real-world TSF datasets, WAVE consistently outperforms AR baselines and achieves competitive or state-of-the-art results, with notable gains from linear attention variants and minimal additional computational cost. The work offers a practical, scalable approach to enhancing autoregressive Transformers for time series, with potential extensions to multivariate forecasting and broader sequence modeling tasks.

Abstract

We propose a Weighted Autoregressive Varying gatE (WAVE) attention mechanism equipped with both Autoregressive (AR) and Moving-average (MA) components. It can adapt to various attention mechanisms, enhancing and decoupling their ability to capture long-range and local temporal patterns in time series data. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that WAVE attention that incorporates the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.
Paper Structure (23 sections, 6 equations, 12 figures, 12 tables)

This paper contains 23 sections, 6 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: (Left: a) Overall architecture of our decoder Transformer for TSF. (Right: b) Box plots of performance rankings from 48 sub-experiments across 12 datasets. Green represents WAVE Transformers, yellow AR Transformers, and red the baselines, with triangles indicating mean rankings. AR Transformers perform comparably to baselines, while WAVE Transformers significantly outperform their AR counterparts. See Table and for more details.
  • Figure 2: Visualization of different effects with exponential decay strategies and their challenges in gated linear attention. (Left: a) Pure exponential decay strategy in gated linear attention; (Mid: b) Exponential decay facing challenges in capturing long-term dependencies; (Right: c) Exponential decay facing challenges in capturing periodic dependencies
  • Figure 3: WAVE attention structure with the indirect MA weight generation method applied to softmax and linear attention. See Table \ref{['tab:ARMA_summary']} for more calculation details.
  • Figure 4: Visualization of the $\mathbf{B} (\textrm{left}) -\mathbf{\Theta} (\textrm{right})$ relationship with different $\phi(\cdot)$. We construct the simulated $\mathbf{B}$ matrices using randomly sampled $\bm{q}$ and $\bm{k}$ ($N=64$, $d=32$) from the normal distribution, and display the corresponding implicit $\mathbf{\Theta}$ matrices.
  • Figure 5: Visualization of test loss curves. We show the testing performance of five attention mechanisms using pure AR/WAVE structures on the Weather and ETTm1 datasets ($L_I = 512$, $L_P = 48$).
  • ...and 7 more figures