Table of Contents
Fetching ...

Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection

Haoxuan Wang, Qingdong He, Jinlong Peng, Hao Yang, Mingmin Chi, Yabiao Wang

TL;DR

Mamba-YOLO-World is presented, a novel YOLO-based OVD model employing the proposed MambaFusion Path Aggregation Network (MambaFusion-PAN) as its neck architecture and surpasses existing state-of-the-art OVD methods with fewer parameters and FLOPs.

Abstract

Open-vocabulary detection (OVD) aims to detect objects beyond a predefined set of categories. As a pioneering model incorporating the YOLO series into OVD, YOLO-World is well-suited for scenarios prioritizing speed and efficiency. However, its performance is hindered by its neck feature fusion mechanism, which causes the quadratic complexity and the limited guided receptive fields. To address these limitations, we present Mamba-YOLO-World, a novel YOLO-based OVD model employing the proposed MambaFusion Path Aggregation Network (MambaFusion-PAN) as its neck architecture. Specifically, we introduce an innovative State Space Model-based feature fusion mechanism consisting of a Parallel-Guided Selective Scan algorithm and a Serial-Guided Selective Scan algorithm with linear complexity and globally guided receptive fields. It leverages multi-modal input sequences and mamba hidden states to guide the selective scanning process. Experiments demonstrate that our model outperforms the original YOLO-World on the COCO and LVIS benchmarks in both zero-shot and fine-tuning settings while maintaining comparable parameters and FLOPs. Additionally, it surpasses existing state-of-the-art OVD methods with fewer parameters and FLOPs.

Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection

TL;DR

Mamba-YOLO-World is presented, a novel YOLO-based OVD model employing the proposed MambaFusion Path Aggregation Network (MambaFusion-PAN) as its neck architecture and surpasses existing state-of-the-art OVD methods with fewer parameters and FLOPs.

Abstract

Open-vocabulary detection (OVD) aims to detect objects beyond a predefined set of categories. As a pioneering model incorporating the YOLO series into OVD, YOLO-World is well-suited for scenarios prioritizing speed and efficiency. However, its performance is hindered by its neck feature fusion mechanism, which causes the quadratic complexity and the limited guided receptive fields. To address these limitations, we present Mamba-YOLO-World, a novel YOLO-based OVD model employing the proposed MambaFusion Path Aggregation Network (MambaFusion-PAN) as its neck architecture. Specifically, we introduce an innovative State Space Model-based feature fusion mechanism consisting of a Parallel-Guided Selective Scan algorithm and a Serial-Guided Selective Scan algorithm with linear complexity and globally guided receptive fields. It leverages multi-modal input sequences and mamba hidden states to guide the selective scanning process. Experiments demonstrate that our model outperforms the original YOLO-World on the COCO and LVIS benchmarks in both zero-shot and fine-tuning settings while maintaining comparable parameters and FLOPs. Additionally, it surpasses existing state-of-the-art OVD methods with fewer parameters and FLOPs.
Paper Structure (17 sections, 1 equation, 3 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 1 equation, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Visualization Results of Zero-shot Inference on LVISlvis. Our Mamba-YOLO-World significantly outperforms YOLO-World in terms of accuracy and generalization across small, medium, and large models.
  • Figure 2: Overall Architecture of Mamba-YOLO-World. It consists of five key components: (a) MambaFusion-PAN is our proposed feature fusion network for replacing the Path Aggregation Feature Pyramid Network in YOLO. (b) TextMambaBlock comprises stacked Mamba layers scanning the input text embeddings to extract the output text features and text hidden state (THS). (c) MF-CSPLayer incorporates the proposed PGSS algorithm into a YOLO CSPLayer style network. (d) In the Parallel-Guided Selective Scan (PGSS) algorithm, the compressed textual information THS is injected into Mamba parameters in parallel with the entire visual selective scanning process to extract the output image features and image hidden state (IHS). (e) SGSS-TextMambaBlock is a TextMambaBlock with a Serial-Guided Selective Scan algorithm. It adjusts Mamba parameters in serial by scanning the compressed visual information IHS before extracting the text features.
  • Figure 3: Comparison of Neck FLOPs Across Different Image Resolutions