Table of Contents
Fetching ...

DGE-YOLO: Dual-Branch Gathering and Attention for Accurate UAV Object Detection

Kunwei Lv, Zhiren Xiao, Hang Ren, Ping Lan

TL;DR

DGE-YOLO tackles UAV object detection under challenging conditions by fusing infrared and visible data through a dual-branch backbone, an Efficient Multi-Scale Attention module, and a Gather-and-Distribute neck within an end-to-end YOLO-based detector. The approach enables robust cross-scale and cross-modality feature learning while preserving efficiency, leading to improved small-object detection. Empirical results on the Drone Vehicle dataset show clear gains over both single-modal and multimodal baselines, with ablations validating the contributions of each module. This framework offers a practical path toward robust, real-time multimodal UAV perception compatible with existing YOLO variants.

Abstract

The rapid proliferation of unmanned aerial vehicles (UAVs) has highlighted the importance of robust and efficient object detection in diverse aerial scenarios. Detecting small objects under complex conditions, however, remains a significant challenge.To address this, we present DGE-YOLO, an enhanced YOLO-based detection framework designed to effectively fuse multi-modal information. We introduce a dual-branch architecture for modality-specific feature extraction, enabling the model to process both infrared and visible images. To further enrich semantic representation, we propose an Efficient Multi-scale Attention (EMA) mechanism that enhances feature learning across spatial scales. Additionally, we replace the conventional neck with a Gather-and-Distribute(GD) module to mitigate information loss during feature aggregation. Extensive experiments on the Drone Vehicle dataset demonstrate that DGE-YOLO achieves superior performance over state-of-the-art methods, validating its effectiveness in multi-modal UAV object detection tasks.

DGE-YOLO: Dual-Branch Gathering and Attention for Accurate UAV Object Detection

TL;DR

DGE-YOLO tackles UAV object detection under challenging conditions by fusing infrared and visible data through a dual-branch backbone, an Efficient Multi-Scale Attention module, and a Gather-and-Distribute neck within an end-to-end YOLO-based detector. The approach enables robust cross-scale and cross-modality feature learning while preserving efficiency, leading to improved small-object detection. Empirical results on the Drone Vehicle dataset show clear gains over both single-modal and multimodal baselines, with ablations validating the contributions of each module. This framework offers a practical path toward robust, real-time multimodal UAV perception compatible with existing YOLO variants.

Abstract

The rapid proliferation of unmanned aerial vehicles (UAVs) has highlighted the importance of robust and efficient object detection in diverse aerial scenarios. Detecting small objects under complex conditions, however, remains a significant challenge.To address this, we present DGE-YOLO, an enhanced YOLO-based detection framework designed to effectively fuse multi-modal information. We introduce a dual-branch architecture for modality-specific feature extraction, enabling the model to process both infrared and visible images. To further enrich semantic representation, we propose an Efficient Multi-scale Attention (EMA) mechanism that enhances feature learning across spatial scales. Additionally, we replace the conventional neck with a Gather-and-Distribute(GD) module to mitigate information loss during feature aggregation. Extensive experiments on the Drone Vehicle dataset demonstrate that DGE-YOLO achieves superior performance over state-of-the-art methods, validating its effectiveness in multi-modal UAV object detection tasks.

Paper Structure

This paper contains 10 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Schematic diagrams of three types of detection frames.
  • Figure 2: The architecture of the proposed DGE-YOLO. Where $P_a$ and $P_b$ respectively represent Feature extracted from infrared and visible images by backbone, while $P_c$ represents Feature after fusion. $P_c$ is divided into $B2$, $B3$, $B4$ and $B5$ for subsequent processing. Avgpool, Billinear and concat in Low-Gather-and-Distribute form Low-FAM, and the rest is Low-IFM. Similarly, avgpool and concat in High-Gather-and-Distribute form High-FAM, and the rest is High-IFM.
  • Figure 3: EMA Module Network Architecture Diagram.
  • Figure 4: Visualization results. Compared our method with results of baseline in two unimodal. The red detection box represents the car class, pink represents the truck class, purple represents the freight car class, and orange represents the bus class. The red dashed boxes represent misdetected targets, the yellow dashed boxes represent missed targets, and the green dashed boxes represent correctly detected targets.
  • Figure 5: Introduce comparisons of different modules in different modalities.