Table of Contents
Fetching ...

SEMixer: Semantics Enhanced MLP-Mixer for Multiscale Mixing and Long-term Time Series Forecasting

Xu Zhang, Qitong Wang, Peng Wang, Wei Wang

TL;DR

The effectiveness of SEMixer is validated on 10 public datasets, but also on the \textit{2025 CCF AlOps Challenge} based on 21GB real wireless network data, where SEMixer achieves third place.

Abstract

Modeling multiscale patterns is crucial for long-term time series forecasting (TSF). However, redundancy and noise in time series, together with semantic gaps between non-adjacent scales, make the efficient alignment and integration of multi-scale temporal dependencies challenging. To address this, we propose SEMixer, a lightweight multiscale model designed for long-term TSF. SEMixer features two key components: a Random Attention Mechanism (RAM) and a Multiscale Progressive Mixing Chain (MPMC). RAM captures diverse time-patch interactions during training and aggregates them via dropout ensemble at inference, enhancing patch-level semantics and enabling MLP-Mixer to better model multi-scale dependencies. MPMC further stacks RAM and MLP-Mixer in a memory-efficient manner, achieving more effective temporal mixing. It addresses semantic gaps across scales and facilitates better multiscale modeling and forecasting performance. We not only validate the effectiveness of SEMixer on 10 public datasets, but also on the \textit{2025 CCF AlOps Challenge} based on 21GB real wireless network data, where SEMixer achieves third place. The code is available at the link https://github.com/Meteor-Stars/SEMixer.

SEMixer: Semantics Enhanced MLP-Mixer for Multiscale Mixing and Long-term Time Series Forecasting

TL;DR

The effectiveness of SEMixer is validated on 10 public datasets, but also on the \textit{2025 CCF AlOps Challenge} based on 21GB real wireless network data, where SEMixer achieves third place.

Abstract

Modeling multiscale patterns is crucial for long-term time series forecasting (TSF). However, redundancy and noise in time series, together with semantic gaps between non-adjacent scales, make the efficient alignment and integration of multi-scale temporal dependencies challenging. To address this, we propose SEMixer, a lightweight multiscale model designed for long-term TSF. SEMixer features two key components: a Random Attention Mechanism (RAM) and a Multiscale Progressive Mixing Chain (MPMC). RAM captures diverse time-patch interactions during training and aggregates them via dropout ensemble at inference, enhancing patch-level semantics and enabling MLP-Mixer to better model multi-scale dependencies. MPMC further stacks RAM and MLP-Mixer in a memory-efficient manner, achieving more effective temporal mixing. It addresses semantic gaps across scales and facilitates better multiscale modeling and forecasting performance. We not only validate the effectiveness of SEMixer on 10 public datasets, but also on the \textit{2025 CCF AlOps Challenge} based on 21GB real wireless network data, where SEMixer achieves third place. The code is available at the link https://github.com/Meteor-Stars/SEMixer.
Paper Structure (30 sections, 7 equations, 5 figures, 11 tables, 1 algorithm)

This paper contains 30 sections, 7 equations, 5 figures, 11 tables, 1 algorithm.

Figures (5)

  • Figure 1: Sub-figures (a)-(e): evaluate whether popular multi- and single-scale models benefit from longer historical sequences, reporting average MSE and MAE across 96, 192, 336, and 720 steps. (f): overall comparison on long-term forecasting. (g): training efficiency and memory overhead. SEMixer uses the same hyperparameters for all datasets and prediction lengths, adjusting only input length, demonstrating strong generalization and performance gains from longer sequences.
  • Figure 2: SEMixer components: The Multiscale Encoding Block processes historical input $\mathcal{X}_h$ into $S$ multiscale inputs $\mathcal{X}_d^1, \mathcal{X}_d^2, ..., \mathcal{X}_d^S$. The MPMC structure then progressively performs temporal mixing on the finest-to-coarsest multiscale inputs to capture the multiscale temporal dependence. Each Temporal Mixing Block includes RAM (sub-figure b), Inter Patch Mixing ($\textit{Permute}_1$+$\textit{MLP}_1$), and Intra Patch Mixing ($\textit{Permute}_2$+$\textit{MLP}_2$). The multiscale outputs from all blocks are integrated for future forecasting.
  • Figure 3: Visualizing learned 12000 patch embeddings of 4 scale inputs $\mathcal{X}_p^1$,$\dots$, $\mathcal{X}_p^4$ on Weather test set with predicting length 720.
  • Figure 4: Further ablation study of MPMC structure. Average MSE across all prediction lengths (1020, 1320 and 1620) under 2048 and 2560 sequence lengths on the long-term TSF task.
  • Figure 5: Hyperparameter analysis of the random disconnection probability $p$ on various datasets.