Fourier-Mixed Window Attention: Accelerating Informer for Long Sequence Time-Series Forecasting
Nhat Thanh Tran, Jack Xin
TL;DR
This work addresses the computational bottleneck of long-sequence time-series forecasting by introducing Fourier-Mixed window attention (FWin), a local-global attention mechanism that replaces Informer's ProbSparse blocks with windowed self-attention followed by a Fourier mixing layer. The authors provide a formal construction of FWin, prove that full attention can be exactly represented by a mixed-window formulation under the block-diagonal invertibility (BDI) condition, and demonstrate empirically that FWin achieves comparable or better accuracy than Informer while substantially reducing inference time (1.6–2×) and model size. A lighter variant, FWin-S, further speeds up inference with competitive performance. The approach is validated on diverse univariate and multivariate datasets, including a power-grid scenario, and is shown to be robust without relying on prior knowledge about seasonality or sparsity patterns. The combination of theoretical guarantees and practical speedups highlights FWin as a versatile, data-agnostic acceleration strategy for long-sequence forecasting.
Abstract
We study a fast local-global window-based attention method to accelerate Informer for long sequence time-series forecasting. While window attention being local is a considerable computational saving, it lacks the ability to capture global token information which is compensated by a subsequent Fourier transform block. Our method, named FWin, does not rely on query sparsity hypothesis and an empirical approximation underlying the ProbSparse attention of Informer. Through experiments on univariate and multivariate datasets, we show that FWin transformers improve the overall prediction accuracies of Informer while accelerating its inference speeds by 1.6 to 2 times. We also provide a mathematical definition of FWin attention, and prove that it is equivalent to the canonical full attention under the block diagonal invertibility (BDI) condition of the attention matrix. The BDI is shown experimentally to hold with high probability for typical benchmark datasets.
