Fourier-Mixed Window Attention: Accelerating Informer for Long Sequence Time-Series Forecasting

Nhat Thanh Tran; Jack Xin

Fourier-Mixed Window Attention: Accelerating Informer for Long Sequence Time-Series Forecasting

Nhat Thanh Tran, Jack Xin

TL;DR

This work addresses the computational bottleneck of long-sequence time-series forecasting by introducing Fourier-Mixed window attention (FWin), a local-global attention mechanism that replaces Informer's ProbSparse blocks with windowed self-attention followed by a Fourier mixing layer. The authors provide a formal construction of FWin, prove that full attention can be exactly represented by a mixed-window formulation under the block-diagonal invertibility (BDI) condition, and demonstrate empirically that FWin achieves comparable or better accuracy than Informer while substantially reducing inference time (1.6–2×) and model size. A lighter variant, FWin-S, further speeds up inference with competitive performance. The approach is validated on diverse univariate and multivariate datasets, including a power-grid scenario, and is shown to be robust without relying on prior knowledge about seasonality or sparsity patterns. The combination of theoretical guarantees and practical speedups highlights FWin as a versatile, data-agnostic acceleration strategy for long-sequence forecasting.

Abstract

We study a fast local-global window-based attention method to accelerate Informer for long sequence time-series forecasting. While window attention being local is a considerable computational saving, it lacks the ability to capture global token information which is compensated by a subsequent Fourier transform block. Our method, named FWin, does not rely on query sparsity hypothesis and an empirical approximation underlying the ProbSparse attention of Informer. Through experiments on univariate and multivariate datasets, we show that FWin transformers improve the overall prediction accuracies of Informer while accelerating its inference speeds by 1.6 to 2 times. We also provide a mathematical definition of FWin attention, and prove that it is equivalent to the canonical full attention under the block diagonal invertibility (BDI) condition of the attention matrix. The BDI is shown experimentally to hold with high probability for typical benchmark datasets.

Fourier-Mixed Window Attention: Accelerating Informer for Long Sequence Time-Series Forecasting

TL;DR

Abstract

Paper Structure (18 sections, 6 theorems, 34 equations, 8 figures, 14 tables)

This paper contains 18 sections, 6 theorems, 34 equations, 8 figures, 14 tables.

Introduction
Background and Preliminary
Methodology
Experiment
Results and Analysis
Theoretical Results
Conclusion
Related Work
Ablation Study
Effect of window size parameter
Theoretical Results
Additional Experimental Data and Details
Setup of experiments
Setup of train/inference time
FWin Accuracy with Standard Deviation
...and 3 more sections

Key Result

Theorem 5.6

Let $Q, K, V \in\mathbb{R}^{L\times d}$. Let $w\in \mathbb{N}$ such that $w$ divides $L$. If Attn($Q,K$) is BDI, then there exists a matrix $A\in \mathbb{R}^{L\times L}$ such that In particular, we can construct the exact value of $A$.

Figures (8)

Figure 1: Model comparison: Informer (left), FWin (right, orange color denotes our contributions); FWin-S (FWin with its decoder's Fourier Mix block removed).
Figure 2: Window size versus error for ETTh1 multivariate data on the long range prediction metric of 720.
Figure 3: Univariate post fault prediction comparison (voltage vs. time in second) on power grid data powergrid_toolboxglassoformer_22: {FWin,Informer} outperform (FED,Auto,ETS)formers and PatchTST. The dashed line under 2 second duration is the input, to the right of which are the predictions vs. the ground truth (in black).
Figure 4: Univariate post fault prediction (voltage vs. time in second) on power grid data powergrid_toolboxglassoformer_22. FWin, FWin-S have "smooth" predictions while Informer has spurious jumps. Full in the bottom frame refers to Informer using full attention instead of probsparse. The dashed line to the left of 2 second is the input, to the right of which are the model predictions vs. the ground truth (in black).
Figure 5: Condition number of ETTh1 (M) dataset under various window sizes. On the top right corner of each subplot there is a label "n/m (k %)", here m denotes the total number of condition numbers, n denotes the number of condition numbers that are infinite, and $k$ denote the percentage of condition numbers that is infinite.
...and 3 more figures

Theorems & Definitions (23)

Definition 5.1
Definition 5.2
Definition 5.3
Definition 5.4
Definition 5.5
Theorem 5.6
Definition 5.7
Corollary 5.8
Definition 5.9
Corollary 5.10
...and 13 more

Fourier-Mixed Window Attention: Accelerating Informer for Long Sequence Time-Series Forecasting

TL;DR

Abstract

Fourier-Mixed Window Attention: Accelerating Informer for Long Sequence Time-Series Forecasting

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (23)