Table of Contents
Fetching ...

Unlocking the Power of Patch: Patch-Based MLP for Long-Term Time Series Forecasting

Peiwang Tang, Weitai Zhang

TL;DR

The paper questions the supremacy of Transformer models for long-term time series forecasting, arguing that patch-based input and cross-variable interactions largely drive performance. It introduces PatchMLP, a concise, fully-MLP architecture using Multi-Scale Patch Embedding, moving-average based Feature Decomposition, and intra-/inter-variable MLPs with a dot-product coupling to enable information exchange across variables. Empirical evaluations on eight real-world datasets show PatchMLP achieving state-of-the-art performance on all 16 benchmarks, often outperforming Transformer baselines. The results underscore the importance of cross-variable interactions and patch-based representations for efficient, accurate LTSF, suggesting a shift toward simpler, more interpretable models that emphasize locality and inter-variable synergy.

Abstract

Recent studies have attempted to refine the Transformer architecture to demonstrate its effectiveness in Long-Term Time Series Forecasting (LTSF) tasks. Despite surpassing many linear forecasting models with ever-improving performance, we remain skeptical of Transformers as a solution for LTSF. We attribute the effectiveness of these models largely to the adopted Patch mechanism, which enhances sequence locality to an extent yet fails to fully address the loss of temporal information inherent to the permutation-invariant self-attention mechanism. Further investigation suggests that simple linear layers augmented with the Patch mechanism may outperform complex Transformer-based LTSF models. Moreover, diverging from models that use channel independence, our research underscores the importance of cross-variable interactions in enhancing the performance of multivariate time series forecasting. The interaction information between variables is highly valuable but has been misapplied in past studies, leading to suboptimal cross-variable models. Based on these insights, we propose a novel and simple Patch-based MLP (PatchMLP) for LTSF tasks. Specifically, we employ simple moving averages to extract smooth components and noise-containing residuals from time series data, engaging in semantic information interchange through channel mixing and specializing in random noise with channel independence processing. The PatchMLP model consistently achieves state-of-the-art results on several real-world datasets. We hope this surprising finding will spur new research directions in the LTSF field and pave the way for more efficient and concise solutions.

Unlocking the Power of Patch: Patch-Based MLP for Long-Term Time Series Forecasting

TL;DR

The paper questions the supremacy of Transformer models for long-term time series forecasting, arguing that patch-based input and cross-variable interactions largely drive performance. It introduces PatchMLP, a concise, fully-MLP architecture using Multi-Scale Patch Embedding, moving-average based Feature Decomposition, and intra-/inter-variable MLPs with a dot-product coupling to enable information exchange across variables. Empirical evaluations on eight real-world datasets show PatchMLP achieving state-of-the-art performance on all 16 benchmarks, often outperforming Transformer baselines. The results underscore the importance of cross-variable interactions and patch-based representations for efficient, accurate LTSF, suggesting a shift toward simpler, more interpretable models that emphasize locality and inter-variable synergy.

Abstract

Recent studies have attempted to refine the Transformer architecture to demonstrate its effectiveness in Long-Term Time Series Forecasting (LTSF) tasks. Despite surpassing many linear forecasting models with ever-improving performance, we remain skeptical of Transformers as a solution for LTSF. We attribute the effectiveness of these models largely to the adopted Patch mechanism, which enhances sequence locality to an extent yet fails to fully address the loss of temporal information inherent to the permutation-invariant self-attention mechanism. Further investigation suggests that simple linear layers augmented with the Patch mechanism may outperform complex Transformer-based LTSF models. Moreover, diverging from models that use channel independence, our research underscores the importance of cross-variable interactions in enhancing the performance of multivariate time series forecasting. The interaction information between variables is highly valuable but has been misapplied in past studies, leading to suboptimal cross-variable models. Based on these insights, we propose a novel and simple Patch-based MLP (PatchMLP) for LTSF tasks. Specifically, we employ simple moving averages to extract smooth components and noise-containing residuals from time series data, engaging in semantic information interchange through channel mixing and specializing in random noise with channel independence processing. The PatchMLP model consistently achieves state-of-the-art results on several real-world datasets. We hope this surprising finding will spur new research directions in the LTSF field and pave the way for more efficient and concise solutions.
Paper Structure (14 sections, 1 equation, 6 figures, 2 tables)

This paper contains 14 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The self-attention scores of a 2-layer Transformer with different Patch size trained on ETTh1. We follow the setup of PathcTST nie2022time, retaining only the Encoder while replacing the Decoder with a simple MLP, and using a channel independent approach. A patch size of 1 is equivalent to the original Transformer, indicating that time series data often exhibits a trend of being segmented into patches zhang2022crossformertang2023infomaxformer, and an increase in patch size can mitigate this to some extent.
  • Figure 2: Experimental results of Patch Transformer on the ETTh2 dataset. (a) Maintain all other parameters constant, and present the MSE outcomes for four forecast lengths with only the input length altered. (b) Keeping all other parameters constant and only altering the size of the patch, the MSE results for a forecast length of 720 with five different input lengths. (c) Maintaining all other parameters unchanged and only varying the patch size, the MSE results for five different $d_{model}$ values with both input and forecast lengths set to 720.
  • Figure 3: Overall structure of PatchMLP. First, the raw time series of different variables are independently processed through Multi-scale Patch Embedding. Then, Feature Decomposition uses moving averages to decompose the embedded tokens into smooth components and noisy residues. Next, a MLP processes the sequences in two ways: intra-variable and inter-variable. Finally, the Predictor maps the latent vectors back to predictions and aggregates them into future series.
  • Figure 4: Overall structure of MLP layer. The embedded vectors first interact with the temporal information within the variable through the Intra-Variable MLP. Then interact with the feature domain information between variables through the Intra-Variable MLP. Subsequently, they are multiplied by the input of the Inter-Variable MLP using a dot-product approach. Finally, they are added to the initial input of the MLP Layer using skip connections.
  • Figure 5: Forecasting performance (MSE) with varying look-back windows on 3 datasets: ETTh1, ETTm2, and Weather. The look-back windows are selected to be $L=\{192, 288, 384, 480, 576, 672, 768\}$, and the prediction horizons are $T = \{192, 720\}$.
  • ...and 1 more figures