Mamba or Transformer for Time Series Forecasting? Mixture of Universals (MoU) Is All You Need

Sijia Peng; Yun Xiong; Yangyong Zhu; Zhiqiang Shen

Mamba or Transformer for Time Series Forecasting? Mixture of Universals (MoU) Is All You Need

Sijia Peng, Yun Xiong, Yangyong Zhu, Zhiqiang Shen

TL;DR

MoU tackles time series forecasting by unifying adaptive short-term patch embedding with a hierarchical long-term encoder. The MoF component provides an adaptive, sparse mixture of feature extractors to capture patch-context diversity, while MoA stacks Mamba, convolution, and self-attention to progressively widen temporal awareness from partial to global views. Across seven real-world datasets, MoU achieves state-of-the-art results with substantially lower computational cost than pure Transformer approaches, showing strong improvements in both MSE and MAE over diverse baselines. The work introduces a principled, efficient framework for long-horizon forecasting and offers detailed ablations and analyses to justify the design choices, with publicly available code for replication.

Abstract

Time series forecasting requires balancing short-term and long-term dependencies for accurate predictions. Existing methods mainly focus on long-term dependency modeling, neglecting the complexities of short-term dynamics, which may hinder performance. Transformers are superior in modeling long-term dependencies but are criticized for their quadratic computational cost. Mamba provides a near-linear alternative but is reported less effective in time series longterm forecasting due to potential information loss. Current architectures fall short in offering both high efficiency and strong performance for long-term dependency modeling. To address these challenges, we introduce Mixture of Universals (MoU), a versatile model to capture both short-term and long-term dependencies for enhancing performance in time series forecasting. MoU is composed of two novel designs: Mixture of Feature Extractors (MoF), an adaptive method designed to improve time series patch representations for short-term dependency, and Mixture of Architectures (MoA), which hierarchically integrates Mamba, FeedForward, Convolution, and Self-Attention architectures in a specialized order to model long-term dependency from a hybrid perspective. The proposed approach achieves state-of-the-art performance while maintaining relatively low computational costs. Extensive experiments on seven real-world datasets demonstrate the superiority of MoU. Code is available at https://github.com/lunaaa95/mou/.

Mamba or Transformer for Time Series Forecasting? Mixture of Universals (MoU) Is All You Need

TL;DR

Abstract

Paper Structure (33 sections, 24 equations, 10 figures, 6 tables)

This paper contains 33 sections, 24 equations, 10 figures, 6 tables.

Introduction
Approach
Problem Setting and Model Structure
Mixture of Feature Extractors
Mixture of Architectures
Computational Complexity and Model Parameter
Experiments
Datasets
Baselines and Setup
Main Results
Ablation Study
Ablation for feature extractor design.
Ablation for long-term encoders design.
Model Analysis
Does MoF actually learn contexts within patches?
...and 18 more sections

Figures (10)

Figure 1: Model efficiency comparison. The results are on ETTm2 with forecasting length of 720 by a unified testing.
Figure 2: Illustration of different architectures for long-term time series forecasting. From left to right are PatchTST / Transformer nie2022time, Mamba gu2023mambawang2024mamba, ModernTCN donghao2024moderntcn, Mambaformer xu2024integrating, and our proposed MoU. Feed-forward layer is omitted for simplicity in Transformer and our model.
Figure 3: Illustration of the proposed Mixture of Feature Extractors (MoF) structure. MoF contains multiple Sub-Extractors, each is tailored to learn different contexts within individual patches. Sub-Extractors are selectively activated by Router in a sparse manner, thereby ensuring both adaptivity and high efficiency.
Figure 4: Illustration of the proposed Mixture of Architectures (MoA) structure. MoA initially concentrates on part of dependencies selected by SSM, and progressively expands the receptive field into a comprehensive global view.
Figure 5: The patches categorized by their activated sub-extractors automatically.
...and 5 more figures

Mamba or Transformer for Time Series Forecasting? Mixture of Universals (MoU) Is All You Need

TL;DR

Abstract

Mamba or Transformer for Time Series Forecasting? Mixture of Universals (MoU) Is All You Need

Authors

TL;DR

Abstract

Table of Contents

Figures (10)