Table of Contents
Fetching ...

Visual Mamba: A Survey and New Outlooks

Rui Xu, Shu Yang, Yihui Wang, Yu Cai, Bo Du, Hao Chen

TL;DR

Visual Mamba advances long-sequence vision modeling by combining structured state-space models with selective, input-dependent dynamics to achieve near-transformer capabilities at linear sequence-length cost. The survey details foundational Mamba formulations, diverse backbone designs, and broad modality applications across image, video, point clouds, and multi-modal data, while candidly discussing scalability, causality, and safety challenges. Key contributions include a decoupled taxonomy of scanning techniques, architectural variants (Mamba-1 and Mamba-2), and a comprehensive catalog of backbone designs and optimization strategies. The findings underscore Visual Mamba's strong potential as a visual foundation architecture, while highlighting areas for performance scaling, hardware optimization, and robust, interpretable deployment in real-world systems.

Abstract

Mamba, a recent selective structured state space model, excels in long sequence modeling, which is vital in the large model era. Long sequence modeling poses significant challenges, including capturing long-range dependencies within the data and handling the computational demands caused by their extensive length. Mamba addresses these challenges by overcoming the local perception limitations of convolutional neural networks and the quadratic computational complexity of Transformers. Given its advantages over these mainstream foundation architectures, Mamba exhibits great potential to be a visual foundation architecture. Since January 2024, Mamba has been actively applied to diverse computer vision tasks, yielding numerous contributions. To help keep pace with the rapid advancements, this paper reviews visual Mamba approaches, analyzing over 200 papers. This paper begins by delineating the formulation of the original Mamba model. Subsequently, it delves into representative backbone networks, and applications categorized using different modalities, including image, video, point cloud, and multi-modal data. Particularly, we identify scanning techniques as critical for adapting Mamba to vision tasks, and decouple these scanning techniques to clarify their functionality and enhance their flexibility across various applications. Finally, we discuss the challenges and future directions, providing insights into new outlooks in this fast evolving area. A comprehensive list of visual Mamba models reviewed in this work is available at https://github.com/Ruixxxx/Awesome-Vision-Mamba-Models.

Visual Mamba: A Survey and New Outlooks

TL;DR

Visual Mamba advances long-sequence vision modeling by combining structured state-space models with selective, input-dependent dynamics to achieve near-transformer capabilities at linear sequence-length cost. The survey details foundational Mamba formulations, diverse backbone designs, and broad modality applications across image, video, point clouds, and multi-modal data, while candidly discussing scalability, causality, and safety challenges. Key contributions include a decoupled taxonomy of scanning techniques, architectural variants (Mamba-1 and Mamba-2), and a comprehensive catalog of backbone designs and optimization strategies. The findings underscore Visual Mamba's strong potential as a visual foundation architecture, while highlighting areas for performance scaling, hardware optimization, and robust, interpretable deployment in real-world systems.

Abstract

Mamba, a recent selective structured state space model, excels in long sequence modeling, which is vital in the large model era. Long sequence modeling poses significant challenges, including capturing long-range dependencies within the data and handling the computational demands caused by their extensive length. Mamba addresses these challenges by overcoming the local perception limitations of convolutional neural networks and the quadratic computational complexity of Transformers. Given its advantages over these mainstream foundation architectures, Mamba exhibits great potential to be a visual foundation architecture. Since January 2024, Mamba has been actively applied to diverse computer vision tasks, yielding numerous contributions. To help keep pace with the rapid advancements, this paper reviews visual Mamba approaches, analyzing over 200 papers. This paper begins by delineating the formulation of the original Mamba model. Subsequently, it delves into representative backbone networks, and applications categorized using different modalities, including image, video, point cloud, and multi-modal data. Particularly, we identify scanning techniques as critical for adapting Mamba to vision tasks, and decouple these scanning techniques to clarify their functionality and enhance their flexibility across various applications. Finally, we discuss the challenges and future directions, providing insights into new outlooks in this fast evolving area. A comprehensive list of visual Mamba models reviewed in this work is available at https://github.com/Ruixxxx/Awesome-Vision-Mamba-Models.
Paper Structure (48 sections, 9 equations, 7 figures, 9 tables)

This paper contains 48 sections, 9 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: The statistics of Mamba-based papers released to date on vision tasks, spanning different modalities including Image, Video, Point Cloud, and Multi-Modal.
  • Figure 2: Mamba-1 block: sequential generation of SSM parameters vs. Mamba-2 block: parallel generation of SSM parameters.
  • Figure 3: Scanning techniques, categorized into four groups, i.e., scan direction, scan axis, scan continuity, and scan sampling.
  • Figure 4: Visual Mamba blocks, including Vision Mamba (Vim) icml24/vim, Visual State Space (VSS) neurips24/vmamba, MSVMamba neurips24/MSVMamba, PlainMamba bmvc24/plainmamba, and LocalMamba arxiv24/localmamba blocks. The original Mamba arxiv23/mamba block is presented as a reference for the advancements in the visual Mamba blocks. The scanning techniques and their decoupling results are displayed to the left and above the corresponding blocks, respectively.
  • Figure 5: Comparative analysis of the performance and computational complexity across various visual backbone architectures, encompassing Convolution-based methods cvpr20/regnetcvpr22/convnextarxiv24/mambaout, Transformer-based methods icml21/deiticcv21/swinviteccv22/wavevitneurips23/svtpami23/volo, and Mamba-based methods icml24/vimbmvc24/plainmambaneurips24/vmambaneurips24/MSVMambaarxiv24/Mamba-Rneurips24/QuadMambaarxiv24/VSSD. The symbol size is proportional to the parameter count of respective model, providing a visual indicator of model scale and complexity. Note that all data are accessed from released academic papers to ensure fairness and credibility.
  • ...and 2 more figures