MambaVision: A Hybrid Mamba-Transformer Vision Backbone
Ali Hatamizadeh, Jan Kautz
TL;DR
MambaVision introduces a hybrid Mamba-Transformer vision backbone by re-designing the Mamba block for vision tasks and strategically placing self-attention in the final layers. The four-stage hierarchical architecture uses CNN-based blocks early on for efficiency and combines MambaVision mixer with Transformer blocks later to recover global context. Through extensive experiments on ImageNet-1K and downstream tasks like COCO and ADE20K, the approach achieves a new Pareto frontier in Top-1 accuracy versus throughput and scales effectively to ImageNet-21K. Ablation studies validate design choices around token mixing, hybrid patterns, and attention windows, while interpretability analyses show semantically meaningful attention localization. Overall, MambaVision demonstrates that a carefully balanced Mamba-Transformer hybrid can outperform pure Mamba or ViT backbones across vision benchmarks and scales well to large datasets.
Abstract
We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. Through a comprehensive ablation study, we demonstrate the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results show that equipping the Mamba architecture with self-attention blocks in the final layers greatly improves its capacity to capture long-range spatial dependencies. Based on these findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For classification on the ImageNet-1K dataset, MambaVision variants achieve state-of-the-art (SOTA) performance in terms of both Top-1 accuracy and throughput. In downstream tasks such as object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably sized backbones while demonstrating favorable performance. Code: https://github.com/NVlabs/MambaVision
