Table of Contents
Fetching ...

SST: Multi-Scale Hybrid Mamba-Transformer Experts for Time Series Forecasting

Xiongxiao Xu, Canyu Chen, Yueqing Liang, Baixiang Huang, Guangji Bai, Liang Zhao, Kai Shu

TL;DR

The paper tackles the scalability of time-series forecasting by integrating memory-efficient State Space Models with expressive attention through a hybrid Mamba-Transformer. It exposes the failure of naive stacking and introduces a time-series decomposition to separate long-range patterns from short-range variations, enabling a multi-scale, MoE-style architecture named SST. SST employs a Mamba-based patterns expert for long-range dynamics, a Local Window Transformer for short-range variations, and a long-short router to adaptively fuse their outputs, with a patching scheme that yields varying resolutions. Empirical results on seven real-world datasets show state-of-the-art accuracy with linear complexity in sequence length, validating both the design principles and the practical efficiency of SST.

Abstract

Time series forecasting has made significant advances, including with Transformer-based models. The attention mechanism in Transformer effectively captures temporal dependencies by attending to all past inputs simultaneously. However, its quadratic complexity with respect to sequence length limits the scalability for long-range modeling. Recent state space models (SSMs) such as Mamba offer a promising alternative by achieving linear complexity without attention. Yet, Mamba compresses historical information into a fixed-size latent state, potentially causing information loss and limiting representational effectiveness. This raises a key research question: Can we design a hybrid Mamba-Transformer architecture that is both effective and efficient for time series forecasting? To address it, we adapt a hybrid Mamba-Transformer architecture Mambaformer, originally proposed for language modeling, to the time series domain. Preliminary experiments reveal that naively stacking Mamba and Transformer layers in Mambaformer is suboptimal for time series forecasting, due to an information interference problem. To mitigate this issue, we introduce a new time series decomposition strategy that separates time series into long-range patterns and short-range variations. Then we show that Mamba excels at capturing long-term structures, while Transformer is more effective at modeling short-term dynamics. Building on this insight, we propose State Space Transformer (SST), a multi-scale hybrid model with expert modules: a Mamba expert for long-range patterns and a Transformer expert for short-term variations. SST also employs a multi-scale patching mechanism to adaptively adjust time series resolution: low resolution for long-term patterns and high resolution for short-term variations. Experiments show that SST obtains SOTA performance with linear scalability. The code is at https://github.com/XiongxiaoXu/SST.

SST: Multi-Scale Hybrid Mamba-Transformer Experts for Time Series Forecasting

TL;DR

The paper tackles the scalability of time-series forecasting by integrating memory-efficient State Space Models with expressive attention through a hybrid Mamba-Transformer. It exposes the failure of naive stacking and introduces a time-series decomposition to separate long-range patterns from short-range variations, enabling a multi-scale, MoE-style architecture named SST. SST employs a Mamba-based patterns expert for long-range dynamics, a Local Window Transformer for short-range variations, and a long-short router to adaptively fuse their outputs, with a patching scheme that yields varying resolutions. Empirical results on seven real-world datasets show state-of-the-art accuracy with linear complexity in sequence length, validating both the design principles and the practical efficiency of SST.

Abstract

Time series forecasting has made significant advances, including with Transformer-based models. The attention mechanism in Transformer effectively captures temporal dependencies by attending to all past inputs simultaneously. However, its quadratic complexity with respect to sequence length limits the scalability for long-range modeling. Recent state space models (SSMs) such as Mamba offer a promising alternative by achieving linear complexity without attention. Yet, Mamba compresses historical information into a fixed-size latent state, potentially causing information loss and limiting representational effectiveness. This raises a key research question: Can we design a hybrid Mamba-Transformer architecture that is both effective and efficient for time series forecasting? To address it, we adapt a hybrid Mamba-Transformer architecture Mambaformer, originally proposed for language modeling, to the time series domain. Preliminary experiments reveal that naively stacking Mamba and Transformer layers in Mambaformer is suboptimal for time series forecasting, due to an information interference problem. To mitigate this issue, we introduce a new time series decomposition strategy that separates time series into long-range patterns and short-range variations. Then we show that Mamba excels at capturing long-term structures, while Transformer is more effective at modeling short-term dynamics. Building on this insight, we propose State Space Transformer (SST), a multi-scale hybrid model with expert modules: a Mamba expert for long-range patterns and a Transformer expert for short-term variations. SST also employs a multi-scale patching mechanism to adaptively adjust time series resolution: low resolution for long-term patterns and high resolution for short-term variations. Experiments show that SST obtains SOTA performance with linear scalability. The code is at https://github.com/XiongxiaoXu/SST.
Paper Structure (25 sections, 6 equations, 8 figures, 3 tables)

This paper contains 25 sections, 6 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Comparison of memory mechanism between Mamba and attention, where $x_i$ denotes the input token of $i-th$ step. Top: Mamba, as a RNN-like mechanism, compresses previous tokens into a fixed-size state $h_{t-1}$, which serves as the memory. When the current token $x_t$ occurs, $x_t$ is incorporated into $h_{t-1}$, leading to a new memory $h_t$ with the same size. This fixed size means that the memory is inherently lossy but linearly efficient. Bottom: Attention stores all previous tokens’ keys $k$ and values $v$ as memory. The memory is updated by continuously adding the current token’s key and value, so the memory is lossless. Therefore, attention can effectively manage short sequences but may encounter computational difficulties with longer ones.
  • Figure 2: From bottom to top, as time series resolution shifts from fine-grained to coarse-grained, patterns become increasingly pronounced while variations diminish.
  • Figure 3: The architectures of Mambaformer family. Positional encoding is optional across these variants. Mamba layers inherently encode positional information by state dynamics while Transformer layers require explicit positional encoding. When a Mamba layer is before an attention layer (Mamba, Mamba-Attention, and Mambaformer), positional encoding can be omitted. However, if the attention layer comes first (Transformer and Attention-Mamba), positional encoding is necessary.
  • Figure 4: A real-world HPC dataset reflecting supercomputers' behaviors can be decomposed into patterns and variations by ranges. Long-term patterns (orange line) indicate repeated up-and-down trends because supercomputers intermittently executes applications, and short-term variations (green line) are extreme execution times caused by sudden network congestion.
  • Figure 5: The overview of the SST. The multi-scale patcher transforms input time series in different resolutions according to ranges. The Mamba is dedicated for long-term patterns and the LWT is responsible for short-term variations. The long-short router adaptively learns the contributions of the two experts.
  • ...and 3 more figures

Theorems & Definitions (1)

  • definition 1