Table of Contents
Fetching ...

Linear Transformers as VAR Models: Aligning Autoregressive Attention Mechanisms with Autoregressive Forecasting

Jiecheng Lu, Shihao Yang

TL;DR

The paper addresses the problem that autoregressive TSF with deep Transformers often misaligns with autoregressive objectives, hindering VAR-like interpretability. It reframes linear attention as a dynamic VAR and restructures the architecture into SAMoVAR, using temporal influence paths and ARX tokenization to align multi-layer linear attention with VAR forecasting. Key contributions include a VAR interpretation of single-layer linear attention, an analysis of misalignment sources with a proposed alignment strategy, and the SAMoVAR variant that delivers improved accuracy, interpretability, and efficiency across synthetic and real TSF benchmarks. The work advances interpretable, efficient TSF forecasting and suggests broader applicability of VAR-aligned Transformers to sequence modeling tasks.

Abstract

Autoregressive attention-based time series forecasting (TSF) has drawn increasing interest, with mechanisms like linear attention sometimes outperforming vanilla attention. However, deeper Transformer architectures frequently misalign with autoregressive objectives, obscuring the underlying VAR structure embedded within linear attention and hindering their ability to capture the data generative processes in TSF. In this work, we first show that a single linear attention layer can be interpreted as a dynamic vector autoregressive (VAR) structure. We then explain that existing multi-layer Transformers have structural mismatches with the autoregressive forecasting objective, which impair interpretability and generalization ability. To address this, we show that by rearranging the MLP, attention, and input-output flow, multi-layer linear attention can also be aligned as a VAR model. Then, we propose Structural Aligned Mixture of VAR (SAMoVAR), a linear Transformer variant that integrates interpretable dynamic VAR weights for multivariate TSF. By aligning the Transformer architecture with autoregressive objectives, SAMoVAR delivers improved performance, interpretability, and computational efficiency, comparing to SOTA TSF models.

Linear Transformers as VAR Models: Aligning Autoregressive Attention Mechanisms with Autoregressive Forecasting

TL;DR

The paper addresses the problem that autoregressive TSF with deep Transformers often misaligns with autoregressive objectives, hindering VAR-like interpretability. It reframes linear attention as a dynamic VAR and restructures the architecture into SAMoVAR, using temporal influence paths and ARX tokenization to align multi-layer linear attention with VAR forecasting. Key contributions include a VAR interpretation of single-layer linear attention, an analysis of misalignment sources with a proposed alignment strategy, and the SAMoVAR variant that delivers improved accuracy, interpretability, and efficiency across synthetic and real TSF benchmarks. The work advances interpretable, efficient TSF forecasting and suggests broader applicability of VAR-aligned Transformers to sequence modeling tasks.

Abstract

Autoregressive attention-based time series forecasting (TSF) has drawn increasing interest, with mechanisms like linear attention sometimes outperforming vanilla attention. However, deeper Transformer architectures frequently misalign with autoregressive objectives, obscuring the underlying VAR structure embedded within linear attention and hindering their ability to capture the data generative processes in TSF. In this work, we first show that a single linear attention layer can be interpreted as a dynamic vector autoregressive (VAR) structure. We then explain that existing multi-layer Transformers have structural mismatches with the autoregressive forecasting objective, which impair interpretability and generalization ability. To address this, we show that by rearranging the MLP, attention, and input-output flow, multi-layer linear attention can also be aligned as a VAR model. Then, we propose Structural Aligned Mixture of VAR (SAMoVAR), a linear Transformer variant that integrates interpretable dynamic VAR weights for multivariate TSF. By aligning the Transformer architecture with autoregressive objectives, SAMoVAR delivers improved performance, interpretability, and computational efficiency, comparing to SOTA TSF models.

Paper Structure

This paper contains 19 sections, 11 equations, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: Visualization of Key Concepts in SAMoVAR. The subfigures highlight different structural and conceptual elements of the model.
  • Figure 2: Illustration of the ARX tokenization, where we use $s_{t}^j$ to represent the $t$-th patch token of series $j$, $\mathbf{S}_I^{[i:i+L_P,j]}$.
  • Figure 3: Visualization of the validation datapoint and model weights for the synthetic VAR task. See Section \ref{['sec:syn']} for more details.
  • Figure 4: Visualization of the loss curves for synthetic VAR tasks.
  • Figure 5: Visualization of the 2 temporal influence paths from step 124 to step 128 for the two series in the datapoint shown in Fig. \ref{['fig:synthetic']}, where even-numbered steps represent endogenous tokens and odd-numbered steps represent exogenous tokens.
  • ...and 8 more figures