Table of Contents
Fetching ...

Mamba in Vision: A Comprehensive Survey of Techniques and Applications

Md Maklachur Rahman, Abdullah Aman Tutul, Ankur Nath, Lamyanba Laishram, Soon Ki Jung, Tracy Hammond

TL;DR

This survey analyzes the unique contributions, computational benefits, and applications of Mamba models while also identifying challenges and potential future research directions and provides a foundational resource for advancing the understanding and growth of Mamba models in computer vision.

Abstract

Mamba is emerging as a novel approach to overcome the challenges faced by Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in computer vision. While CNNs excel at extracting local features, they often struggle to capture long-range dependencies without complex architectural modifications. In contrast, ViTs effectively model global relationships but suffer from high computational costs due to the quadratic complexity of their self-attention mechanisms. Mamba addresses these limitations by leveraging Selective Structured State Space Models to effectively capture long-range dependencies with linear computational complexity. This survey analyzes the unique contributions, computational benefits, and applications of Mamba models while also identifying challenges and potential future research directions. We provide a foundational resource for advancing the understanding and growth of Mamba models in computer vision. An overview of this work is available at https://github.com/maklachur/Mamba-in-Computer-Vision.

Mamba in Vision: A Comprehensive Survey of Techniques and Applications

TL;DR

This survey analyzes the unique contributions, computational benefits, and applications of Mamba models while also identifying challenges and potential future research directions and provides a foundational resource for advancing the understanding and growth of Mamba models in computer vision.

Abstract

Mamba is emerging as a novel approach to overcome the challenges faced by Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in computer vision. While CNNs excel at extracting local features, they often struggle to capture long-range dependencies without complex architectural modifications. In contrast, ViTs effectively model global relationships but suffer from high computational costs due to the quadratic complexity of their self-attention mechanisms. Mamba addresses these limitations by leveraging Selective Structured State Space Models to effectively capture long-range dependencies with linear computational complexity. This survey analyzes the unique contributions, computational benefits, and applications of Mamba models while also identifying challenges and potential future research directions. We provide a foundational resource for advancing the understanding and growth of Mamba models in computer vision. An overview of this work is available at https://github.com/maklachur/Mamba-in-Computer-Vision.
Paper Structure (34 sections, 4 equations, 7 figures, 20 tables)

This paper contains 34 sections, 4 equations, 7 figures, 20 tables.

Figures (7)

  • Figure 1: (a) Comparative trade-offs of CNN, Transformer, and Mamba frameworks. (b) Comparison of Top-1 accuracy for CNN, Transformer, and Mamba models on ImageNet-1K hatamizadeh2024mambavision.
  • Figure 2: Overall taxonomy of Mamba models in computer vision tasks, categorized by their application areas. This taxonomy includes models from the baseline Mamba model gu2023mamba up to those published by July 15, 2024.
  • Figure 3: Overall pipeline of a Mamba-based vision model.
  • Figure 4: Various scanning patterns for image patches in Mamba are shown, with each numbered patch indicating the scanning order. A: Sequential scan without continuity, B: Sequential zigzag scan, C: Diagonal scan without continuity, D: Diagonal zigzag scan, E: Spiral scan, F: Radial scan, and G: Hilbert curve scan.
  • Figure 5: Different scanning methods used in Mamba models for visual tasks.
  • ...and 2 more figures