Mamba YOLO: A Simple Baseline for Object Detection with State Space Model
Zeyu Wang, Chen Li, Huiying Xu, Xinzhong Zhu, Hongbo Li
TL;DR
This work addresses the quadratic complexity of Transformer-based self-attention in real-time object detection by introducing Mamba YOLO, a simple, pretraining-free baseline built on a State Space Model with linear complexity. The architecture combines the ODMamba backbone with the ODSSBlock, which decouples global spatial processing (SS2D) from channel-wise fusion (RG Block), and a Vision Clue Merge to preserve spatial cues; multi-scale features are fused via a PAFPN-like neck to feed a Decoupled Head. Key contributions include the ODSSBlock design, the Residual Gated (RG) Block, and the Vision Clue Merge, with extensive COCO experiments showing state-of-the-art speed–accuracy trade-offs and a tiny variant achieving 7.5% mAP improvement at 1.5 ms latency on a 4090 GPU. The approach delivers a practical, high-performance baseline for real-time YOLO-style detection, reducing reliance on large-scale pretraining while maintaining competitive accuracy and efficiency.
Abstract
Driven by the rapid development of deep learning technology, the YOLO series has set a new benchmark for real-time object detectors. Additionally, transformer-based structures have emerged as the most powerful solution in the field, greatly extending the model's receptive field and achieving significant performance improvements. However, this improvement comes at a cost as the quadratic complexity of the self-attentive mechanism increases the computational burden of the model. To address this problem, we introduce a simple yet effective baseline approach called Mamba YOLO. Our contributions are as follows: 1) We propose that the ODMamba backbone introduce a \textbf{S}tate \textbf{S}pace \textbf{M}odel (\textbf{SSM}) with linear complexity to address the quadratic complexity of self-attention. Unlike the other Transformer-base and SSM-base method, ODMamba is simple to train without pretraining. 2) For real-time requirement, we designed the macro structure of ODMamba, determined the optimal stage ratio and scaling size. 3) We design the RG Block that employs a multi-branch structure to model the channel dimensions, which addresses the possible limitations of SSM in sequence modeling, such as insufficient receptive fields and weak image localization. This design captures localized image dependencies more accurately and significantly. Extensive experiments on the publicly available COCO benchmark dataset show that Mamba YOLO achieves state-of-the-art performance compared to previous methods. Specifically, a tiny version of Mamba YOLO achieves a \textbf{7.5}\% improvement in mAP on a single 4090 GPU with an inference time of \textbf{1.5} ms. The pytorch code is available at: \url{https://github.com/HZAI-ZJNU/Mamba-YOLO}
