A Survey on Mamba Architecture for Vision Applications
Fady Ibrahim, Guangjun Liu, Guanghui Wang
TL;DR
This survey tackles the prohibitive quadratic complexity of attention in vision transformers by examining Mamba, a State Space Model-based architecture with near-linear scalability in sequence length. It analyzes ViM for images and VideoMamba for videos, detailing architectural innovations such as bidirectional scanning, selective state-space parameterization, and structure-aware fusion to capture local and global context efficiently. The paper comprehensively compares performance across image classification, semantic segmentation, and object detection, highlighting the trade-offs between accuracy and computational cost and identifying variants like Hi-Mamba, NC-SSD, and HSM-SSD as promising directions. Overall, Mamba-based vision backbones offer a compelling alternative to transformers for scalable, long-range visual understanding, with practical impact in high-resolution and video analytics and potential extensions to multi-modal tasks.
Abstract
Transformers have become foundational for visual tasks such as object detection, semantic segmentation, and video understanding, but their quadratic complexity in attention mechanisms presents scalability challenges. To address these limitations, the Mamba architecture utilizes state-space models (SSMs) for linear scalability, efficient processing, and improved contextual awareness. This paper investigates Mamba architecture for visual domain applications and its recent advancements, including Vision Mamba (ViM) and VideoMamba, which introduce bidirectional scanning, selective scanning mechanisms, and spatiotemporal processing to enhance image and video understanding. Architectural innovations like position embeddings, cross-scan modules, and hierarchical designs further optimize the Mamba framework for global and local feature extraction. These advancements position Mamba as a promising architecture in computer vision research and applications.
