WAVE: Weighted Autoregressive Varying Gate for Time Series Forecasting
Jiecheng Lu, Xu Han, Yan Sun, Shihao Yang
TL;DR
The paper tackles time series forecasting with decoder-only autoregressive Transformers and introduces WAVE, a weighted autoregressive varying gate that injects a moving-average term into AR attention via an indirect MA weight generation scheme. This design preserves $O(N)$ time complexity and the original parameter count while decoupling short-term local effects from long-term patterns, enabling improved modeling of seasonal and cyclic temporal dependencies. Across 12 real-world TSF datasets, WAVE consistently outperforms AR baselines and achieves competitive or state-of-the-art results, with notable gains from linear attention variants and minimal additional computational cost. The work offers a practical, scalable approach to enhancing autoregressive Transformers for time series, with potential extensions to multivariate forecasting and broader sequence modeling tasks.
Abstract
We propose a Weighted Autoregressive Varying gatE (WAVE) attention mechanism equipped with both Autoregressive (AR) and Moving-average (MA) components. It can adapt to various attention mechanisms, enhancing and decoupling their ability to capture long-range and local temporal patterns in time series data. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that WAVE attention that incorporates the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.
