Table of Contents
Fetching ...

A Survey on Mamba Architecture for Vision Applications

Fady Ibrahim, Guangjun Liu, Guanghui Wang

TL;DR

This survey tackles the prohibitive quadratic complexity of attention in vision transformers by examining Mamba, a State Space Model-based architecture with near-linear scalability in sequence length. It analyzes ViM for images and VideoMamba for videos, detailing architectural innovations such as bidirectional scanning, selective state-space parameterization, and structure-aware fusion to capture local and global context efficiently. The paper comprehensively compares performance across image classification, semantic segmentation, and object detection, highlighting the trade-offs between accuracy and computational cost and identifying variants like Hi-Mamba, NC-SSD, and HSM-SSD as promising directions. Overall, Mamba-based vision backbones offer a compelling alternative to transformers for scalable, long-range visual understanding, with practical impact in high-resolution and video analytics and potential extensions to multi-modal tasks.

Abstract

Transformers have become foundational for visual tasks such as object detection, semantic segmentation, and video understanding, but their quadratic complexity in attention mechanisms presents scalability challenges. To address these limitations, the Mamba architecture utilizes state-space models (SSMs) for linear scalability, efficient processing, and improved contextual awareness. This paper investigates Mamba architecture for visual domain applications and its recent advancements, including Vision Mamba (ViM) and VideoMamba, which introduce bidirectional scanning, selective scanning mechanisms, and spatiotemporal processing to enhance image and video understanding. Architectural innovations like position embeddings, cross-scan modules, and hierarchical designs further optimize the Mamba framework for global and local feature extraction. These advancements position Mamba as a promising architecture in computer vision research and applications.

A Survey on Mamba Architecture for Vision Applications

TL;DR

This survey tackles the prohibitive quadratic complexity of attention in vision transformers by examining Mamba, a State Space Model-based architecture with near-linear scalability in sequence length. It analyzes ViM for images and VideoMamba for videos, detailing architectural innovations such as bidirectional scanning, selective state-space parameterization, and structure-aware fusion to capture local and global context efficiently. The paper comprehensively compares performance across image classification, semantic segmentation, and object detection, highlighting the trade-offs between accuracy and computational cost and identifying variants like Hi-Mamba, NC-SSD, and HSM-SSD as promising directions. Overall, Mamba-based vision backbones offer a compelling alternative to transformers for scalable, long-range visual understanding, with practical impact in high-resolution and video analytics and potential extensions to multi-modal tasks.

Abstract

Transformers have become foundational for visual tasks such as object detection, semantic segmentation, and video understanding, but their quadratic complexity in attention mechanisms presents scalability challenges. To address these limitations, the Mamba architecture utilizes state-space models (SSMs) for linear scalability, efficient processing, and improved contextual awareness. This paper investigates Mamba architecture for visual domain applications and its recent advancements, including Vision Mamba (ViM) and VideoMamba, which introduce bidirectional scanning, selective scanning mechanisms, and spatiotemporal processing to enhance image and video understanding. Architectural innovations like position embeddings, cross-scan modules, and hierarchical designs further optimize the Mamba framework for global and local feature extraction. These advancements position Mamba as a promising architecture in computer vision research and applications.

Paper Structure

This paper contains 30 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Mamba Block Architecture with selective state space models.
  • Figure 2: Single direction scanning mechanism on patch embeddings shown in (a) follows the original Mamba architecture. Bidirectional Selective Scanning (b) introduces as a novel contribution in ViM.
  • Figure 3: 3D Spatiotemporal Bidirectional Scanning. VideoMamba li2025videomamba introduces an enhanced scanning mechanism for 3D input data to combine spatial data with temporal data.
  • Figure 4: Comparison of Mamba architectures for 2D image applications; (a) Vision Mamba, (b) Vmamba, (c) LocalMamba is a set of multidirectional blocks that can be applied to other Mamba architectures, (d) PlainMamba, (e) SpatialMamba, (f) Famba-V introduces the concept of token fusion that can be applied to other Mamba architectures, (g) Hi-Mamba, used to extract high-resolution image features and for high-resolution image restoration.
  • Figure 5: (a) VideoMamba Block and (b) VideoMambaPro Architectures. Both employ similar architectures while (a) is the same as zhu2024vision adapted for 3D data, (b) uses the concept of residual SSM and masked patches in the backward direction to improve on the architecture in (a).
  • ...and 1 more figures