Table of Contents
Fetching ...

Minusformer: Improving Time Series Forecasting by Progressively Learning Residuals

Daojun Liang, Haixia Zhang, Dongfeng Yuan, Bingzheng Zhang, Minggao Zhang

TL;DR

Minusformer tackles pervasive overfitting in time series forecasting by reorienting Transformer aggregations from addition to subtraction and by adding a highway of auxiliary outputs that progressively learn residual supervision signals. Framed as a deep-ensemble, Boosting-like process, the architecture uses dual streams and gate-controlled blocks to decompose inputs and labels, yielding variance reduction and improved generalization. The authors provide theoretical variance bounds and validate the approach across diverse real-world datasets, reporting average gains of about 11.9% over strong baselines and robust performance on Monash TS benchmarks. The method is shown to be adaptable to alternative Attention mechanisms, interpretable through block-wise visualization, and scalable to deeper architectures with stable performance, underscoring its practical impact for non-stationary TS forecasting.

Abstract

In this paper, we find that ubiquitous time series (TS) forecasting models are prone to severe overfitting. To cope with this problem, we embrace a de-redundancy approach to progressively reinstate the intrinsic values of TS for future intervals. Specifically, we introduce a dual-stream and subtraction mechanism, which is a deep Boosting ensemble learning method. And the vanilla Transformer is renovated by reorienting the information aggregation mechanism from addition to subtraction. Then, we incorporate an auxiliary output branch into each block of the original model to construct a highway leading to the ultimate prediction. The output of subsequent modules in this branch will subtract the previously learned results, enabling the model to learn the residuals of the supervision signal, layer by layer. This designing facilitates the learning-driven implicit progressive decomposition of the input and output streams, empowering the model with heightened versatility, interpretability, and resilience against overfitting. Since all aggregations in the model are minus signs, which is called Minusformer. Extensive experiments demonstrate the proposed method outperform existing state-of-the-art methods, yielding an average performance improvement of 11.9% across various datasets.The code has been released at https://github.com/Anoise/Minusformer.

Minusformer: Improving Time Series Forecasting by Progressively Learning Residuals

TL;DR

Minusformer tackles pervasive overfitting in time series forecasting by reorienting Transformer aggregations from addition to subtraction and by adding a highway of auxiliary outputs that progressively learn residual supervision signals. Framed as a deep-ensemble, Boosting-like process, the architecture uses dual streams and gate-controlled blocks to decompose inputs and labels, yielding variance reduction and improved generalization. The authors provide theoretical variance bounds and validate the approach across diverse real-world datasets, reporting average gains of about 11.9% over strong baselines and robust performance on Monash TS benchmarks. The method is shown to be adaptable to alternative Attention mechanisms, interpretable through block-wise visualization, and scalable to deeper architectures with stable performance, underscoring its practical impact for non-stationary TS forecasting.

Abstract

In this paper, we find that ubiquitous time series (TS) forecasting models are prone to severe overfitting. To cope with this problem, we embrace a de-redundancy approach to progressively reinstate the intrinsic values of TS for future intervals. Specifically, we introduce a dual-stream and subtraction mechanism, which is a deep Boosting ensemble learning method. And the vanilla Transformer is renovated by reorienting the information aggregation mechanism from addition to subtraction. Then, we incorporate an auxiliary output branch into each block of the original model to construct a highway leading to the ultimate prediction. The output of subsequent modules in this branch will subtract the previously learned results, enabling the model to learn the residuals of the supervision signal, layer by layer. This designing facilitates the learning-driven implicit progressive decomposition of the input and output streams, empowering the model with heightened versatility, interpretability, and resilience against overfitting. Since all aggregations in the model are minus signs, which is called Minusformer. Extensive experiments demonstrate the proposed method outperform existing state-of-the-art methods, yielding an average performance improvement of 11.9% across various datasets.The code has been released at https://github.com/Anoise/Minusformer.
Paper Structure (34 sections, 1 theorem, 13 equations, 13 figures, 12 tables, 1 algorithm)

This paper contains 34 sections, 1 theorem, 13 equations, 13 figures, 12 tables, 1 algorithm.

Key Result

Theorem 1

Without loss of generality, assume that the estimation error of block $f_l(X)$ is $e_l$, $e_l \overset{i.i.d}{\sim} \mathcal{N}(0, \nu)$. Let $\alpha_l = \alpha$ be the weight of $f_l$, $l\in [0,L]$, and the covariance of estimations of two different blocks by $\mu$, we have

Figures (13)

  • Figure 1: Comparison of the proposed Minusformer and other latest advanced models. The results (MSE) are averaged across all prediction lengths. The numerical suffix after the model indicates the input length of the model. Minusformer is configured with two versions of input length in order to align with other models.
  • Figure 2: Generalization of the model when time series are aggregated in different directions. The experiment was conducted utilizing Transformer with 4 blocks (baseline) on the Traffic dataset.
  • Figure 3: (a) Deep ensemble learning is equivalent to meta-algorithmic Boosting. (b) The relationship between model bias, variance, and loss.
  • Figure 4: The architecture of Minusformer.
  • Figure 5: Ablation studies on various components of Minusformer. All results are averaged across all prediction lengths. The variables $X$ and $Y$ represent the input and output streams, while the signs '+' and '-' denote the addition or subtraction operations used when the streams' aggregation. The letter 'G' denotes adding a gating mechanism to the output of each block.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof