PatchMixer: A Patch-Mixing Architecture for Long-Term Time Series Forecasting

Zeying Gong; Yujin Tang; Junwei Liang

PatchMixer: A Patch-Mixing Architecture for Long-Term Time Series Forecasting

Zeying Gong, Yujin Tang, Junwei Liang

TL;DR

PatchMixer introduces a permutation-variant convolutional structure to preserve temporal information and employs dual forecasting heads encompassing linear and nonlinear components to better model future curve trends and details.

Abstract

Although the Transformer has been the dominant architecture for time series forecasting tasks in recent years, a fundamental challenge remains: the permutation-invariant self-attention mechanism within Transformers leads to a loss of temporal information. To tackle these challenges, we propose PatchMixer, a novel CNN-based model. It introduces a permutation-variant convolutional structure to preserve temporal information. Diverging from conventional CNNs in this field, which often employ multiple scales or numerous branches, our method relies exclusively on depthwise separable convolutions. This allows us to extract both local features and global correlations using a single-scale architecture. Furthermore, we employ dual forecasting heads encompassing linear and nonlinear components to better model future curve trends and details. Our experimental results on seven time-series forecasting benchmarks indicate that compared with the state-of-the-art method and the best-performing CNN, PatchMixer yields $3.9\%$ and $21.2\%$ relative improvements, respectively, while being 2-3x faster than the most advanced method.

PatchMixer: A Patch-Mixing Architecture for Long-Term Time Series Forecasting

TL;DR

Abstract

and

relative improvements, respectively, while being 2-3x faster than the most advanced method.

Paper Structure (15 sections, 3 equations, 6 figures, 5 tables)

This paper contains 15 sections, 3 equations, 6 figures, 5 tables.

Introduction
Related Work
The Patch-Mixing Design
The PatchMixer Model
Model Structure
PatchMixer Block
Dual Forecasting Heads
Instance Normalization
Experiments
Multivariate Long-term Forecasting
Ablation Study
Efficiency Analysis
Patch Embedding and Loss Optimization
Conclusion
Acknowledgements

Figures (6)

Figure 1: Channel Dependency vs. Patch Dependency analysis was conducted on the 3 largest datasets: Traffic, Electricity, and Weather. The left panel shows the normalized mutual information among different variables, revealing sparse correlations. The right panel illustrates a pronounced intra-variable dependency within single-variable temporal patches, indicated by the more intense coloration.
Figure 2: PatchMixer overview.
Figure 3: MSE scores with varying look-back windows on top 3 largest datasets. We report the top 5 methods for better observation. The look-back windows $L=[24,48,96,192,336,720]$, and the prediction horizons $T=[96, 720]$.
Figure 4: PatchMixer vs. PatchTST: Comparison of Efficiencies.
Figure 5: MSE scores with varying patch length on the top $3$ largest datasets. The patch length $P=[1,2,4,8,12,16,24,32,40]$, where $L=336$ and $T=96$. "NA" means the setup runs out of GPU memory (NVIDIA GTX4090 24GB) even with batch size 1.
...and 1 more figures

PatchMixer: A Patch-Mixing Architecture for Long-Term Time Series Forecasting

TL;DR

Abstract

PatchMixer: A Patch-Mixing Architecture for Long-Term Time Series Forecasting

Authors

TL;DR

Abstract

Table of Contents

Figures (6)