Table of Contents
Fetching ...

Mamba YOLO: A Simple Baseline for Object Detection with State Space Model

Zeyu Wang, Chen Li, Huiying Xu, Xinzhong Zhu, Hongbo Li

TL;DR

This work addresses the quadratic complexity of Transformer-based self-attention in real-time object detection by introducing Mamba YOLO, a simple, pretraining-free baseline built on a State Space Model with linear complexity. The architecture combines the ODMamba backbone with the ODSSBlock, which decouples global spatial processing (SS2D) from channel-wise fusion (RG Block), and a Vision Clue Merge to preserve spatial cues; multi-scale features are fused via a PAFPN-like neck to feed a Decoupled Head. Key contributions include the ODSSBlock design, the Residual Gated (RG) Block, and the Vision Clue Merge, with extensive COCO experiments showing state-of-the-art speed–accuracy trade-offs and a tiny variant achieving 7.5% mAP improvement at 1.5 ms latency on a 4090 GPU. The approach delivers a practical, high-performance baseline for real-time YOLO-style detection, reducing reliance on large-scale pretraining while maintaining competitive accuracy and efficiency.

Abstract

Driven by the rapid development of deep learning technology, the YOLO series has set a new benchmark for real-time object detectors. Additionally, transformer-based structures have emerged as the most powerful solution in the field, greatly extending the model's receptive field and achieving significant performance improvements. However, this improvement comes at a cost as the quadratic complexity of the self-attentive mechanism increases the computational burden of the model. To address this problem, we introduce a simple yet effective baseline approach called Mamba YOLO. Our contributions are as follows: 1) We propose that the ODMamba backbone introduce a \textbf{S}tate \textbf{S}pace \textbf{M}odel (\textbf{SSM}) with linear complexity to address the quadratic complexity of self-attention. Unlike the other Transformer-base and SSM-base method, ODMamba is simple to train without pretraining. 2) For real-time requirement, we designed the macro structure of ODMamba, determined the optimal stage ratio and scaling size. 3) We design the RG Block that employs a multi-branch structure to model the channel dimensions, which addresses the possible limitations of SSM in sequence modeling, such as insufficient receptive fields and weak image localization. This design captures localized image dependencies more accurately and significantly. Extensive experiments on the publicly available COCO benchmark dataset show that Mamba YOLO achieves state-of-the-art performance compared to previous methods. Specifically, a tiny version of Mamba YOLO achieves a \textbf{7.5}\% improvement in mAP on a single 4090 GPU with an inference time of \textbf{1.5} ms. The pytorch code is available at: \url{https://github.com/HZAI-ZJNU/Mamba-YOLO}

Mamba YOLO: A Simple Baseline for Object Detection with State Space Model

TL;DR

This work addresses the quadratic complexity of Transformer-based self-attention in real-time object detection by introducing Mamba YOLO, a simple, pretraining-free baseline built on a State Space Model with linear complexity. The architecture combines the ODMamba backbone with the ODSSBlock, which decouples global spatial processing (SS2D) from channel-wise fusion (RG Block), and a Vision Clue Merge to preserve spatial cues; multi-scale features are fused via a PAFPN-like neck to feed a Decoupled Head. Key contributions include the ODSSBlock design, the Residual Gated (RG) Block, and the Vision Clue Merge, with extensive COCO experiments showing state-of-the-art speed–accuracy trade-offs and a tiny variant achieving 7.5% mAP improvement at 1.5 ms latency on a 4090 GPU. The approach delivers a practical, high-performance baseline for real-time YOLO-style detection, reducing reliance on large-scale pretraining while maintaining competitive accuracy and efficiency.

Abstract

Driven by the rapid development of deep learning technology, the YOLO series has set a new benchmark for real-time object detectors. Additionally, transformer-based structures have emerged as the most powerful solution in the field, greatly extending the model's receptive field and achieving significant performance improvements. However, this improvement comes at a cost as the quadratic complexity of the self-attentive mechanism increases the computational burden of the model. To address this problem, we introduce a simple yet effective baseline approach called Mamba YOLO. Our contributions are as follows: 1) We propose that the ODMamba backbone introduce a \textbf{S}tate \textbf{S}pace \textbf{M}odel (\textbf{SSM}) with linear complexity to address the quadratic complexity of self-attention. Unlike the other Transformer-base and SSM-base method, ODMamba is simple to train without pretraining. 2) For real-time requirement, we designed the macro structure of ODMamba, determined the optimal stage ratio and scaling size. 3) We design the RG Block that employs a multi-branch structure to model the channel dimensions, which addresses the possible limitations of SSM in sequence modeling, such as insufficient receptive fields and weak image localization. This design captures localized image dependencies more accurately and significantly. Extensive experiments on the publicly available COCO benchmark dataset show that Mamba YOLO achieves state-of-the-art performance compared to previous methods. Specifically, a tiny version of Mamba YOLO achieves a \textbf{7.5}\% improvement in mAP on a single 4090 GPU with an inference time of \textbf{1.5} ms. The pytorch code is available at: \url{https://github.com/HZAI-ZJNU/Mamba-YOLO}
Paper Structure (21 sections, 16 equations, 6 figures, 4 tables)

This paper contains 21 sections, 16 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparisons of the real-time object detecors on the MSCOCO dataset. The object detection method based on SSM achieves the best trade-off between performance and computations.
  • Figure 2: Illustration of the Mamba YOLO architecture. Mamba YOLO utilizes the ODSSBlock with selective SSM to construct its backbone, employing a Simple Stem to segment the input image into multiple patches, and using Vision Clue Merge for the downsampling operation. Multi-level features such as $\{C3,C4,C5\}$ are extracted from the backbone network and then fused into the PAFPN, and high-level semantic features and low-level spatial features are refined and fused by the ODSSBlock, and the resulting $\{P3,P4,P5\}$ features are outputted to the Decoupled Head to output the detection results.
  • Figure 3: Illustration of the ODSSBlock architecture.
  • Figure 4: Comparison between DINO-R50 and Mamba YOLO-L in terms of GPU memory efficiency and mAP. As the input image resolution increases, DINO requires higher resolution to maintain a high mAP and shows a quadratic growth trend in both GPU memory and FLOPs. In contrast, MambaYOLO maintains a linear increase in GPU memory requirements and achieves the highest performance at a smaller resolution of 640×640, with fewer FLOPs and faster inference.
  • Figure 5: Inference results for each detector on the COCO dataset. Detailed objects have been enlarged for better illustration.
  • ...and 1 more figures