Does Long-Term Series Forecasting Need Complex Attention and Extra Long Inputs?

Daojun Liang; Haixia Zhang; Dongfeng Yuan; Xiaoyan Ma; Dongyang Li; Minggao Zhang

Does Long-Term Series Forecasting Need Complex Attention and Extra Long Inputs?

Daojun Liang, Haixia Zhang, Dongfeng Yuan, Xiaoyan Ma, Dongyang Li, Minggao Zhang

TL;DR

The paper tackles whether complex attention and very long inputs are essential for effective long-term time series forecasting (LTSF). It introduces Periodformer, a light-weight Transformer variant that uses Period-Attention with explicit periodic subseries aggregation and gating to achieve linear-time complexity, along with a moving-average based smoothing and cross-period components. To accelerate hyperparameter optimization on multiple GPUs, it proposes MABO, a multi-GPU asynchronous Bayesian optimization framework. Across six real-world datasets, Periodformer delivers state-of-the-art MSE reductions (roughly 13% for multivariate and 26% for univariate cases) over strong baselines, while MABO reduces search time by about 46%, underscoring the practicality of simpler attention and moderate input lengths for LTSF. The work suggests that long input sequences and dense attention are not universally needed for strong LTSF performance and provides open-source tooling to replicate and extend the results.

Abstract

As Transformer-based models have achieved impressive performance on various time series tasks, Long-Term Series Forecasting (LTSF) tasks have also received extensive attention in recent years. However, due to the inherent computational complexity and long sequences demanding of Transformer-based methods, its application on LTSF tasks still has two major issues that need to be further investigated: 1) Whether the sparse attention mechanism designed by these methods actually reduce the running time on real devices; 2) Whether these models need extra long input sequences to guarantee their performance? The answers given in this paper are negative. Therefore, to better copy with these two issues, we design a lightweight Period-Attention mechanism (Periodformer), which renovates the aggregation of long-term subseries via explicit periodicity and short-term subseries via built-in proximity. Meanwhile, a gating mechanism is embedded into Periodformer to regulate the influence of the attention module on the prediction results. Furthermore, to take full advantage of GPUs for fast hyperparameter optimization (e.g., finding the suitable input length), a Multi-GPU Asynchronous parallel algorithm based on Bayesian Optimization (MABO) is presented. MABO allocates a process to each GPU via a queue mechanism, and then creates multiple trials at a time for asynchronous parallel search, which greatly reduces the search time. Compared with the state-of-the-art methods, the prediction error of Periodformer reduced by 13% and 26% for multivariate and univariate forecasting, respectively. In addition, MABO reduces the average search time by 46% while finding better hyperparameters. As a conclusion, this paper indicates that LTSF may not need complex attention and extra long input sequences. The code has been open sourced on Github.

Does Long-Term Series Forecasting Need Complex Attention and Extra Long Inputs?

TL;DR

Abstract

Paper Structure (26 sections, 8 equations, 18 figures, 6 tables, 2 algorithms)

This paper contains 26 sections, 8 equations, 18 figures, 6 tables, 2 algorithms.

Introduction
Existing Transformer-based Models
Runtime of existing Transformer-based models
Performance of existing Transformer-based models
Impact of the input length on model generalization
Periodformer
Definition
Architecture
Period-Attention
MABO
Asynchronous Parallel Strategy on Multi-GPUs
Hyperparameter Suggestion Strategy
Experiments
Experimental Results
Ablation studies
...and 11 more sections

Figures (18)

Figure 1: Performance (MSE), running time (Seconds/Epoch) and Flops (Bubble Size) comparisons of Transformer-based models on the LTSF task. All models are in Transformer-like architectures with 2-layer encoder and 1-layer decoder. Meanwhile, their input lengths are both 96, and their prediction lengths are 96, 192, 336, and 720, respectively. Periodformer is a Period-Attention based model proposed in this paper. Transformer, Informer and Autoformer are from NIPS2017_Transformer, Zhou2021Informer and wu2021autoformer. Both Full-Attetnion and Prob-Attention models adopt the same architecture as Autoformer, but replace its attention part with Full-Attetnion NIPS2017_Transformer and Prob-Attention wu2021autoformer. All experiments were performed on the ETTm2 dataset using a Tesla V100 GPU, but similar results would be expected on other datasets or devices. The smaller the bubble and the closer it is to the bottom left corner, the better the overall performance of the model will be. Some models, such as FEDformer zhou2022fedformer, are removed from this figure due to their long runtime (5$\times$ slower for FEDformer-f and 15$\times$ slower for FEDformer-w).
Figure 2: Information aggregation strategies adopted by various attention mechanisms. Full Attention NIPS2017_Transformer (a) aggregates information from all moments. Sparse Attention li2019enhancingKitaev2020Reformer (b) aggregates information through fixed intervals or random sampling. Auto-Correlation wu2021autoformer (c) aggregates information through the implicit periodicity obtained by Fourier transform. Period-Attention (ours) (d) aggregates information based on the explicit periodicity of series.
Figure 3: Ablation experiments of Transformer with different components. The vanilla Transformer performs poorly (lower than Autoformer) on LTSF tasks. But if some improvements are made to Transformer, its performance (MSE, the lower, the better) will be greatly changed. Specifically, adding moving average (Trans+MA) to reduce data noise, appropriately increasing Dropout (form 0.05 to 0.1) to change model sparsity will improve the average performance of Transformer. In particular, if attention is removed (Trans$-$Att), its impact on the performance of the model is varied. For example, Trans-Att performs better on ETTm2, but performs worse on Electricity.
Figure 4: The correlation between the input data and the predicted results weakens with distance. They can be divided into invalid, valid and forecast parts according to the period of the series.
Figure 5: Ablation experiments on the input length of the Trans+MA and the kernel size of the moving average (MA) module on the Exchange dataset. Trans+MA is an architecture similar to Autoformer obtained by adding the MA module to the vanilla Transformer. (a) Increasing the input length of Trans+MA will make its performance worse. (b) Increasing the kernel size of MA improves the average performance of Trans+MA when the input length is fixed. The input length is set to be 96 in this experiment. The numbers in the legend represent the forecast lengths, where the forecast errors are measured by MSE (dotted line) and MAE (solid line), respectively.
...and 13 more figures

Does Long-Term Series Forecasting Need Complex Attention and Extra Long Inputs?

TL;DR

Abstract

Does Long-Term Series Forecasting Need Complex Attention and Extra Long Inputs?

Authors

TL;DR

Abstract

Table of Contents

Figures (18)