Table of Contents
Fetching ...

MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection

Ali Behrouz, Michele Santacatterina, Ramin Zabih

TL;DR

MambaMixer introduces a dual-selective state-space framework that jointly selects informative tokens and channels (via S6 blocks) and links layers through weighted averaging of earlier features, enabling deep, scalable models for long-sequence data. The ViM2 and TSM2 variants demonstrate strong, cross-domain performance in vision and time-series forecasting, with competitive accuracy and substantial computational efficiency compared to Transformers and prior SSM-based models. Key contributions include the design of Selective Token/Channel Mixers, a hardware-friendly recurrence, and the demonstration that dual selection and early-feature access substantially improve stability and performance. The results suggest that selective cross-dimension mixing can provide a viable alternative to attention- and MLP-based backbones for long sequences and multi-dimensional data.

Abstract

Recent advances in deep learning have mainly relied on Transformers due to their data dependency and ability to learn at scale. The attention module in these architectures, however, exhibits quadratic time and space in input size, limiting their scalability for long-sequence modeling. Despite recent attempts to design efficient and effective architecture backbone for multi-dimensional data, such as images and multivariate time series, existing models are either data independent, or fail to allow inter- and intra-dimension communication. Recently, State Space Models (SSMs), and more specifically Selective State Space Models, with efficient hardware-aware implementation, have shown promising potential for long sequence modeling. Motivated by the success of SSMs, we present MambaMixer, a new architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels, called Selective Token and Channel Mixer. MambaMixer connects selective mixers using a weighted averaging mechanism, allowing layers to have direct access to early features. As a proof of concept, we design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on the MambaMixer block and explore their performance in various vision and time series forecasting tasks. Our results underline the importance of selective mixing across both tokens and channels. In ImageNet classification, object detection, and semantic segmentation tasks, ViM2 achieves competitive performance with well-established vision models and outperforms SSM-based vision models. In time series forecasting, TSM2 achieves outstanding performance compared to state-of-the-art methods while demonstrating significantly improved computational cost. These results show that while Transformers, cross-channel attention, and MLPs are sufficient for good performance in time series forecasting, neither is necessary.

MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection

TL;DR

MambaMixer introduces a dual-selective state-space framework that jointly selects informative tokens and channels (via S6 blocks) and links layers through weighted averaging of earlier features, enabling deep, scalable models for long-sequence data. The ViM2 and TSM2 variants demonstrate strong, cross-domain performance in vision and time-series forecasting, with competitive accuracy and substantial computational efficiency compared to Transformers and prior SSM-based models. Key contributions include the design of Selective Token/Channel Mixers, a hardware-friendly recurrence, and the demonstration that dual selection and early-feature access substantially improve stability and performance. The results suggest that selective cross-dimension mixing can provide a viable alternative to attention- and MLP-based backbones for long sequences and multi-dimensional data.

Abstract

Recent advances in deep learning have mainly relied on Transformers due to their data dependency and ability to learn at scale. The attention module in these architectures, however, exhibits quadratic time and space in input size, limiting their scalability for long-sequence modeling. Despite recent attempts to design efficient and effective architecture backbone for multi-dimensional data, such as images and multivariate time series, existing models are either data independent, or fail to allow inter- and intra-dimension communication. Recently, State Space Models (SSMs), and more specifically Selective State Space Models, with efficient hardware-aware implementation, have shown promising potential for long sequence modeling. Motivated by the success of SSMs, we present MambaMixer, a new architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels, called Selective Token and Channel Mixer. MambaMixer connects selective mixers using a weighted averaging mechanism, allowing layers to have direct access to early features. As a proof of concept, we design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on the MambaMixer block and explore their performance in various vision and time series forecasting tasks. Our results underline the importance of selective mixing across both tokens and channels. In ImageNet classification, object detection, and semantic segmentation tasks, ViM2 achieves competitive performance with well-established vision models and outperforms SSM-based vision models. In time series forecasting, TSM2 achieves outstanding performance compared to state-of-the-art methods while demonstrating significantly improved computational cost. These results show that while Transformers, cross-channel attention, and MLPs are sufficient for good performance in time series forecasting, neither is necessary.
Paper Structure (23 sections, 10 equations, 8 figures, 8 tables)

This paper contains 23 sections, 10 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Architecture design of MambaMixer. For further potential architectures see \ref{['app:arch']}.
  • Figure 2: Architecture design and overview of the ViM2's pipeline.
  • Figure 3: Architecture design and overview of the TSM2's pipeline.
  • Figure 4: A comparison of input scaling evaluation for ViM2 and baselines. All models have trained with 224 × 224 inputs.
  • Figure 5: The effect of number of parameters and ViM2 performance comparison with baselines on ImageNet-1K.
  • ...and 3 more figures