Visual Mamba: A Survey and New Outlooks
Rui Xu, Shu Yang, Yihui Wang, Yu Cai, Bo Du, Hao Chen
TL;DR
Visual Mamba advances long-sequence vision modeling by combining structured state-space models with selective, input-dependent dynamics to achieve near-transformer capabilities at linear sequence-length cost. The survey details foundational Mamba formulations, diverse backbone designs, and broad modality applications across image, video, point clouds, and multi-modal data, while candidly discussing scalability, causality, and safety challenges. Key contributions include a decoupled taxonomy of scanning techniques, architectural variants (Mamba-1 and Mamba-2), and a comprehensive catalog of backbone designs and optimization strategies. The findings underscore Visual Mamba's strong potential as a visual foundation architecture, while highlighting areas for performance scaling, hardware optimization, and robust, interpretable deployment in real-world systems.
Abstract
Mamba, a recent selective structured state space model, excels in long sequence modeling, which is vital in the large model era. Long sequence modeling poses significant challenges, including capturing long-range dependencies within the data and handling the computational demands caused by their extensive length. Mamba addresses these challenges by overcoming the local perception limitations of convolutional neural networks and the quadratic computational complexity of Transformers. Given its advantages over these mainstream foundation architectures, Mamba exhibits great potential to be a visual foundation architecture. Since January 2024, Mamba has been actively applied to diverse computer vision tasks, yielding numerous contributions. To help keep pace with the rapid advancements, this paper reviews visual Mamba approaches, analyzing over 200 papers. This paper begins by delineating the formulation of the original Mamba model. Subsequently, it delves into representative backbone networks, and applications categorized using different modalities, including image, video, point cloud, and multi-modal data. Particularly, we identify scanning techniques as critical for adapting Mamba to vision tasks, and decouple these scanning techniques to clarify their functionality and enhance their flexibility across various applications. Finally, we discuss the challenges and future directions, providing insights into new outlooks in this fast evolving area. A comprehensive list of visual Mamba models reviewed in this work is available at https://github.com/Ruixxxx/Awesome-Vision-Mamba-Models.
