Table of Contents
Fetching ...

MambaRefine-YOLO: A Dual-Modality Small Object Detector for UAV Imagery

Shuyu Cao, Minxin Chen, Yucheng Song, Zhaozhong Chen, Xinyou Zhang

TL;DR

MambaRefine-YOLO tackles small-object detection in UAV imagery by fusing RGB and IR data through a Dual-Gated Complementary Mamba Fusion Module (DGC-MFM) and refining multi-scale features with a Hierarchical Feature Aggregation Neck (HFAN). The DGC-MFM uses illumination-aware gating and content-aware difference gating to adaptively fuse modalities with linear computational complexity, while HFAN employs a refine-then-fuse strategy to enhance cross-scale features and adds a dedicated small-object head. Empirical results show a new state-of-the-art mAP of 83.2% on the dual-modality DroneVehicle dataset and strong generalization on the single-modality VisDrone dataset (HFAN-YOLO achieving 49.4% mAP), validating both components individually and in combination. The approach offers a favorable trade-off between accuracy and speed, making it suitable for real-time UAV surveillance and other remote-sensing tasks. The architecture benefits from long-range context modeling with the Mamba backbone, achieving linear complexity $O(N)$ and effective cross-modal interaction without prohibitive compute.

Abstract

Small object detection in Unmanned Aerial Vehicle (UAV) imagery is a persistent challenge, hindered by low resolution and background clutter. While fusing RGB and infrared (IR) data offers a promising solution, existing methods often struggle with the trade-off between effective cross-modal interaction and computational efficiency. In this letter, we introduce MambaRefine-YOLO. Its core contributions are a Dual-Gated Complementary Mamba fusion module (DGC-MFM) that adaptively balances RGB and IR modalities through illumination-aware and difference-aware gating mechanisms, and a Hierarchical Feature Aggregation Neck (HFAN) that uses a ``refine-then-fuse'' strategy to enhance multi-scale features. Our comprehensive experiments validate this dual-pronged approach. On the dual-modality DroneVehicle dataset, the full model achieves a state-of-the-art mAP of 83.2%, an improvement of 7.9% over the baseline. On the single-modality VisDrone dataset, a variant using only the HFAN also shows significant gains, demonstrating its general applicability. Our work presents a superior balance between accuracy and speed, making it highly suitable for real-world UAV applications.

MambaRefine-YOLO: A Dual-Modality Small Object Detector for UAV Imagery

TL;DR

MambaRefine-YOLO tackles small-object detection in UAV imagery by fusing RGB and IR data through a Dual-Gated Complementary Mamba Fusion Module (DGC-MFM) and refining multi-scale features with a Hierarchical Feature Aggregation Neck (HFAN). The DGC-MFM uses illumination-aware gating and content-aware difference gating to adaptively fuse modalities with linear computational complexity, while HFAN employs a refine-then-fuse strategy to enhance cross-scale features and adds a dedicated small-object head. Empirical results show a new state-of-the-art mAP of 83.2% on the dual-modality DroneVehicle dataset and strong generalization on the single-modality VisDrone dataset (HFAN-YOLO achieving 49.4% mAP), validating both components individually and in combination. The approach offers a favorable trade-off between accuracy and speed, making it suitable for real-time UAV surveillance and other remote-sensing tasks. The architecture benefits from long-range context modeling with the Mamba backbone, achieving linear complexity and effective cross-modal interaction without prohibitive compute.

Abstract

Small object detection in Unmanned Aerial Vehicle (UAV) imagery is a persistent challenge, hindered by low resolution and background clutter. While fusing RGB and infrared (IR) data offers a promising solution, existing methods often struggle with the trade-off between effective cross-modal interaction and computational efficiency. In this letter, we introduce MambaRefine-YOLO. Its core contributions are a Dual-Gated Complementary Mamba fusion module (DGC-MFM) that adaptively balances RGB and IR modalities through illumination-aware and difference-aware gating mechanisms, and a Hierarchical Feature Aggregation Neck (HFAN) that uses a ``refine-then-fuse'' strategy to enhance multi-scale features. Our comprehensive experiments validate this dual-pronged approach. On the dual-modality DroneVehicle dataset, the full model achieves a state-of-the-art mAP of 83.2%, an improvement of 7.9% over the baseline. On the single-modality VisDrone dataset, a variant using only the HFAN also shows significant gains, demonstrating its general applicability. Our work presents a superior balance between accuracy and speed, making it highly suitable for real-world UAV applications.

Paper Structure

This paper contains 10 sections, 9 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The overall architecture of MambaRefine-YOLO. It consists of a dual-stream backbone where Dual-Gated Complementary Mamba Fusion Modules (DGC-MFM) are applied at four different scales ($C_2$ to $C_5$). The fused features are then processed by the Hierarchical Feature Aggregation Neck (HFAN), which contains several Adaptive Scale Fusion Blocks (ASFB). Finally, a multi-scale detection head produces the output.
  • Figure 2: The Dual-Gated Complementary Mamba Fusion Module (DGC-MFM) consists of four main stages: (1) Illumination Gate (IG) and Difference Gate (DG) generate adaptive weights, (2) Dual-gated fusion combines RGB and IR features, (3) Bidirectional Mamba processes the fused features to capture global context, and (4) Feature Refinement and Integration utilizes residual connections and a Fusion-Shuffle mechanism to generate the final feature pyramid.
  • Figure 3: Qualitative Results of MambaRefine-YOLO (ours) vs. SOTA Methods on DroneVehicle (RGB/IR Fusion). Red and yellow circles highlight misdetections.