Table of Contents
Fetching ...

Mamba or Transformer for Time Series Forecasting? Mixture of Universals (MoU) Is All You Need

Sijia Peng, Yun Xiong, Yangyong Zhu, Zhiqiang Shen

TL;DR

MoU tackles time series forecasting by unifying adaptive short-term patch embedding with a hierarchical long-term encoder. The MoF component provides an adaptive, sparse mixture of feature extractors to capture patch-context diversity, while MoA stacks Mamba, convolution, and self-attention to progressively widen temporal awareness from partial to global views. Across seven real-world datasets, MoU achieves state-of-the-art results with substantially lower computational cost than pure Transformer approaches, showing strong improvements in both MSE and MAE over diverse baselines. The work introduces a principled, efficient framework for long-horizon forecasting and offers detailed ablations and analyses to justify the design choices, with publicly available code for replication.

Abstract

Time series forecasting requires balancing short-term and long-term dependencies for accurate predictions. Existing methods mainly focus on long-term dependency modeling, neglecting the complexities of short-term dynamics, which may hinder performance. Transformers are superior in modeling long-term dependencies but are criticized for their quadratic computational cost. Mamba provides a near-linear alternative but is reported less effective in time series longterm forecasting due to potential information loss. Current architectures fall short in offering both high efficiency and strong performance for long-term dependency modeling. To address these challenges, we introduce Mixture of Universals (MoU), a versatile model to capture both short-term and long-term dependencies for enhancing performance in time series forecasting. MoU is composed of two novel designs: Mixture of Feature Extractors (MoF), an adaptive method designed to improve time series patch representations for short-term dependency, and Mixture of Architectures (MoA), which hierarchically integrates Mamba, FeedForward, Convolution, and Self-Attention architectures in a specialized order to model long-term dependency from a hybrid perspective. The proposed approach achieves state-of-the-art performance while maintaining relatively low computational costs. Extensive experiments on seven real-world datasets demonstrate the superiority of MoU. Code is available at https://github.com/lunaaa95/mou/.

Mamba or Transformer for Time Series Forecasting? Mixture of Universals (MoU) Is All You Need

TL;DR

MoU tackles time series forecasting by unifying adaptive short-term patch embedding with a hierarchical long-term encoder. The MoF component provides an adaptive, sparse mixture of feature extractors to capture patch-context diversity, while MoA stacks Mamba, convolution, and self-attention to progressively widen temporal awareness from partial to global views. Across seven real-world datasets, MoU achieves state-of-the-art results with substantially lower computational cost than pure Transformer approaches, showing strong improvements in both MSE and MAE over diverse baselines. The work introduces a principled, efficient framework for long-horizon forecasting and offers detailed ablations and analyses to justify the design choices, with publicly available code for replication.

Abstract

Time series forecasting requires balancing short-term and long-term dependencies for accurate predictions. Existing methods mainly focus on long-term dependency modeling, neglecting the complexities of short-term dynamics, which may hinder performance. Transformers are superior in modeling long-term dependencies but are criticized for their quadratic computational cost. Mamba provides a near-linear alternative but is reported less effective in time series longterm forecasting due to potential information loss. Current architectures fall short in offering both high efficiency and strong performance for long-term dependency modeling. To address these challenges, we introduce Mixture of Universals (MoU), a versatile model to capture both short-term and long-term dependencies for enhancing performance in time series forecasting. MoU is composed of two novel designs: Mixture of Feature Extractors (MoF), an adaptive method designed to improve time series patch representations for short-term dependency, and Mixture of Architectures (MoA), which hierarchically integrates Mamba, FeedForward, Convolution, and Self-Attention architectures in a specialized order to model long-term dependency from a hybrid perspective. The proposed approach achieves state-of-the-art performance while maintaining relatively low computational costs. Extensive experiments on seven real-world datasets demonstrate the superiority of MoU. Code is available at https://github.com/lunaaa95/mou/.
Paper Structure (33 sections, 24 equations, 10 figures, 6 tables)

This paper contains 33 sections, 24 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Model efficiency comparison. The results are on ETTm2 with forecasting length of 720 by a unified testing.
  • Figure 2: Illustration of different architectures for long-term time series forecasting. From left to right are PatchTST / Transformer nie2022time, Mamba gu2023mambawang2024mamba, ModernTCN donghao2024moderntcn, Mambaformer xu2024integrating, and our proposed MoU. Feed-forward layer is omitted for simplicity in Transformer and our model.
  • Figure 3: Illustration of the proposed Mixture of Feature Extractors (MoF) structure. MoF contains multiple Sub-Extractors, each is tailored to learn different contexts within individual patches. Sub-Extractors are selectively activated by Router in a sparse manner, thereby ensuring both adaptivity and high efficiency.
  • Figure 4: Illustration of the proposed Mixture of Architectures (MoA) structure. MoA initially concentrates on part of dependencies selected by SSM, and progressively expands the receptive field into a comprehensive global view.
  • Figure 5: The patches categorized by their activated sub-extractors automatically.
  • ...and 5 more figures