MambaRefine-YOLO: A Dual-Modality Small Object Detector for UAV Imagery
Shuyu Cao, Minxin Chen, Yucheng Song, Zhaozhong Chen, Xinyou Zhang
TL;DR
MambaRefine-YOLO tackles small-object detection in UAV imagery by fusing RGB and IR data through a Dual-Gated Complementary Mamba Fusion Module (DGC-MFM) and refining multi-scale features with a Hierarchical Feature Aggregation Neck (HFAN). The DGC-MFM uses illumination-aware gating and content-aware difference gating to adaptively fuse modalities with linear computational complexity, while HFAN employs a refine-then-fuse strategy to enhance cross-scale features and adds a dedicated small-object head. Empirical results show a new state-of-the-art mAP of 83.2% on the dual-modality DroneVehicle dataset and strong generalization on the single-modality VisDrone dataset (HFAN-YOLO achieving 49.4% mAP), validating both components individually and in combination. The approach offers a favorable trade-off between accuracy and speed, making it suitable for real-time UAV surveillance and other remote-sensing tasks. The architecture benefits from long-range context modeling with the Mamba backbone, achieving linear complexity $O(N)$ and effective cross-modal interaction without prohibitive compute.
Abstract
Small object detection in Unmanned Aerial Vehicle (UAV) imagery is a persistent challenge, hindered by low resolution and background clutter. While fusing RGB and infrared (IR) data offers a promising solution, existing methods often struggle with the trade-off between effective cross-modal interaction and computational efficiency. In this letter, we introduce MambaRefine-YOLO. Its core contributions are a Dual-Gated Complementary Mamba fusion module (DGC-MFM) that adaptively balances RGB and IR modalities through illumination-aware and difference-aware gating mechanisms, and a Hierarchical Feature Aggregation Neck (HFAN) that uses a ``refine-then-fuse'' strategy to enhance multi-scale features. Our comprehensive experiments validate this dual-pronged approach. On the dual-modality DroneVehicle dataset, the full model achieves a state-of-the-art mAP of 83.2%, an improvement of 7.9% over the baseline. On the single-modality VisDrone dataset, a variant using only the HFAN also shows significant gains, demonstrating its general applicability. Our work presents a superior balance between accuracy and speed, making it highly suitable for real-world UAV applications.
