Table of Contents
Fetching ...

Enhancing Transformer-based models for Long Sequence Time Series Forecasting via Structured Matrix

Zhicheng Zhang, Yong Wang, Shaoqi Tan, Bowei Xia, Yujie Luo

TL;DR

This work targets the scalability of Transformer-based long-sequence time-series forecasting by introducing Surrogate Attention Block (SAB) and Surrogate FFN Block (SFB), both built on Monarch structured matrices to replace the self-attention and feed-forward layers. The authors prove equivalence in expressiveness and establish the SAB as a linear time-invariant system, enabling stable, trainable optimization. The resulting framework achieves sub-quadratic complexity (roughly $O(N^{3/2})$) with substantial reductions in parameters and FLOPS, while delivering average performance gains around 12.4% across five tasks and 2,769 tests. This hardware-friendly substitution framework demonstrates broad applicability to Transformer-based architectures, offering scalable improvements for long-horizon forecasting, imputation, and related time-series tasks. The work also includes ablation studies and analyses of convergence, layer-wise effects, and comparisons with other optimization approaches, highlighting both the strengths and task-dependent limitations of the method.

Abstract

Recently, Transformer-based models for long sequence time series forecasting have demonstrated promising results. The self-attention mechanism as the core component of these Transformer-based models exhibits great potential in capturing various dependencies among data points. Despite these advancements, it has been a subject of concern to improve the efficiency of the self-attention mechanism. Unfortunately, current specific optimization methods are facing the challenges in applicability and scalability for the future design of long sequence time series forecasting models. Hence, in this article, we propose a novel architectural framework that enhances Transformer-based models through the integration of Surrogate Attention Blocks (SAB) and Surrogate Feed-Forward Neural Network Blocks (SFB). The framework reduces both time and space complexity by the replacement of the self-attention and feed-forward layers with SAB and SFB while maintaining their expressive power and architectural advantages. The equivalence of this substitution is fully demonstrated. The extensive experiments on 10 Transformer-based models across five distinct time series tasks demonstrate an average performance improvement of 12.4%, alongside 61.3% reduction in parameter counts.

Enhancing Transformer-based models for Long Sequence Time Series Forecasting via Structured Matrix

TL;DR

This work targets the scalability of Transformer-based long-sequence time-series forecasting by introducing Surrogate Attention Block (SAB) and Surrogate FFN Block (SFB), both built on Monarch structured matrices to replace the self-attention and feed-forward layers. The authors prove equivalence in expressiveness and establish the SAB as a linear time-invariant system, enabling stable, trainable optimization. The resulting framework achieves sub-quadratic complexity (roughly ) with substantial reductions in parameters and FLOPS, while delivering average performance gains around 12.4% across five tasks and 2,769 tests. This hardware-friendly substitution framework demonstrates broad applicability to Transformer-based architectures, offering scalable improvements for long-horizon forecasting, imputation, and related time-series tasks. The work also includes ablation studies and analyses of convergence, layer-wise effects, and comparisons with other optimization approaches, highlighting both the strengths and task-dependent limitations of the method.

Abstract

Recently, Transformer-based models for long sequence time series forecasting have demonstrated promising results. The self-attention mechanism as the core component of these Transformer-based models exhibits great potential in capturing various dependencies among data points. Despite these advancements, it has been a subject of concern to improve the efficiency of the self-attention mechanism. Unfortunately, current specific optimization methods are facing the challenges in applicability and scalability for the future design of long sequence time series forecasting models. Hence, in this article, we propose a novel architectural framework that enhances Transformer-based models through the integration of Surrogate Attention Blocks (SAB) and Surrogate Feed-Forward Neural Network Blocks (SFB). The framework reduces both time and space complexity by the replacement of the self-attention and feed-forward layers with SAB and SFB while maintaining their expressive power and architectural advantages. The equivalence of this substitution is fully demonstrated. The extensive experiments on 10 Transformer-based models across five distinct time series tasks demonstrate an average performance improvement of 12.4%, alongside 61.3% reduction in parameter counts.
Paper Structure (46 sections, 3 theorems, 48 equations, 19 figures, 25 tables)

This paper contains 46 sections, 3 theorems, 48 equations, 19 figures, 25 tables.

Key Result

Proposition 1

Let $W$ be a weight matrix; a linear projection $LinearProj(\mathbf{X}) = \mathbf{X}W$ is equivalent to a structured linear projection $StructuredLinearProj(\mathbf{X}) = \mathbf{X}\mathbf{M}$, where $\mathbf{M}$ is a structured matrix.

Figures (19)

  • Figure 1: Abstract architecture for Transformer-based model. $\mathcal{N}$ denotes the number of Encoder Layers, $\mathcal{N}$ denotes the number of Decoder Layers.
  • Figure 2: Overview of the proposed enhancement process for a Transformer-based model. For ease of presentation, here only one X-MHSA layer and one FNN layer are illustrated in the Transformer-based model, neglecting the encoder-decoder architecture, multi-layer stacking and other layers that were not modified.
  • Figure 3: Box plots of statistical results on the distribution of long-term forecasts for each metric. The horizontal axis denotes the models and the vertical axis is the percentage of lift after using the structured matrix. The width of the color blocks in the box indicates the density of the distribution. The green dashed line is the mean and the orange dashed line is the median.
  • Figure 4: Box plots of statistical results on the distribution of short-term forecasts for each metric. The horizontal axis denotes the models and the vertical axis is the percentage of lift after using the structured matrix. The width of the color blocks in the box indicates the density of the distribution. The green dashed line is the mean and the orange dashed line is the median.
  • Figure 5: Box plots of statistical results for the distribution of imputation for each metric. The horizontal axis is the model and the vertical axis is the percentage of lift after using the structured matrix. The width of the color blocks in the box indicates the density of the distribution. The green dashed line is the mean and the orange dashed line is the median.
  • ...and 14 more figures

Theorems & Definitions (5)

  • Remark 1
  • Proposition 1
  • Proof 1
  • Theorem 1
  • Theorem 2