Table of Contents
Fetching ...

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Ali Hatamizadeh, Jan Kautz

TL;DR

MambaVision introduces a hybrid Mamba-Transformer vision backbone by re-designing the Mamba block for vision tasks and strategically placing self-attention in the final layers. The four-stage hierarchical architecture uses CNN-based blocks early on for efficiency and combines MambaVision mixer with Transformer blocks later to recover global context. Through extensive experiments on ImageNet-1K and downstream tasks like COCO and ADE20K, the approach achieves a new Pareto frontier in Top-1 accuracy versus throughput and scales effectively to ImageNet-21K. Ablation studies validate design choices around token mixing, hybrid patterns, and attention windows, while interpretability analyses show semantically meaningful attention localization. Overall, MambaVision demonstrates that a carefully balanced Mamba-Transformer hybrid can outperform pure Mamba or ViT backbones across vision benchmarks and scales well to large datasets.

Abstract

We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. Through a comprehensive ablation study, we demonstrate the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results show that equipping the Mamba architecture with self-attention blocks in the final layers greatly improves its capacity to capture long-range spatial dependencies. Based on these findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For classification on the ImageNet-1K dataset, MambaVision variants achieve state-of-the-art (SOTA) performance in terms of both Top-1 accuracy and throughput. In downstream tasks such as object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably sized backbones while demonstrating favorable performance. Code: https://github.com/NVlabs/MambaVision

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

TL;DR

MambaVision introduces a hybrid Mamba-Transformer vision backbone by re-designing the Mamba block for vision tasks and strategically placing self-attention in the final layers. The four-stage hierarchical architecture uses CNN-based blocks early on for efficiency and combines MambaVision mixer with Transformer blocks later to recover global context. Through extensive experiments on ImageNet-1K and downstream tasks like COCO and ADE20K, the approach achieves a new Pareto frontier in Top-1 accuracy versus throughput and scales effectively to ImageNet-21K. Ablation studies validate design choices around token mixing, hybrid patterns, and attention windows, while interpretability analyses show semantically meaningful attention localization. Overall, MambaVision demonstrates that a carefully balanced Mamba-Transformer hybrid can outperform pure Mamba or ViT backbones across vision benchmarks and scales well to large datasets.

Abstract

We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. Through a comprehensive ablation study, we demonstrate the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results show that equipping the Mamba architecture with self-attention blocks in the final layers greatly improves its capacity to capture long-range spatial dependencies. Based on these findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For classification on the ImageNet-1K dataset, MambaVision variants achieve state-of-the-art (SOTA) performance in terms of both Top-1 accuracy and throughput. In downstream tasks such as object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably sized backbones while demonstrating favorable performance. Code: https://github.com/NVlabs/MambaVision
Paper Structure (25 sections, 9 equations, 6 figures, 7 tables)

This paper contains 25 sections, 9 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Top-1 accuracy vs. image throughput comparisons on ImageNet-1K dataset. The MambaVision models achieve a new Pareto front for Top-1 accuracy and image throughput tradeoff. Specifically, MambaVision variants outperform Mamba-based models such as VMamba and Vim, sometimes by a significant margin. For all models, image throughput is measured on an A100 NVIDIA GPU with a batch size of 128.
  • Figure 2: The architecture of hierarchical MambaVision models. The first two stages use residual convolutional blocks for fast feature extraction. Stages 3 and 4 employ both MambaVision and Transformer blocks. Specifically, given $N$ layers, we use $\frac{N}{2}$ MambaVision and MLP blocks, which are followed by additional $\frac{N}{2}$ Transformer and MLP blocks. The Transformer blocks in the final layers allow for recovering lost global context and capturing long-range spatial dependencies.
  • Figure 3: Architecture of MambaVision block. In addition to replacing causal Conv layer with their regular counterparts, we create a symmetric path without SSM as a token mixer to enhance the modeling of global context.
  • Figure 4: Performance scalability of MambaVision ImageNet-21K pretrained models with varying model sizes and resolutions.
  • Figure 5: Visualizations of MambaVision's self-attention layers showing how the model learns to focus on semantically meaningful regions via attention maps (middle) and overlays (right).
  • ...and 1 more figures