Table of Contents
Fetching ...

Efficient Feature Fusion for UAV Object Detection

Xudong Wang, Yaxin Peng, Chaomin Shen

TL;DR

This work tackles the imbalance between classification and localization in UAV object detection by introducing a Fusion Multi-Head Self-Attention framework (FMSA) that employs Fusion Down Sample (FDS) and Fusion Up Sample (FUS) modules to enable cross-layer, multi-scale feature fusion within CNNs. Integrated into YOLO-v10, the framework preserves parameter count while enhancing small-object localization and overall classification, leveraging MHSA-based fusion and local/global attention mechanisms. Experimental results on VisDrone2019 and DOTA-v1.5 demonstrate consistent improvements in $mAP_{50}$ (e.g., +2.1 percentage points on VisDrone2019 and +2.0 on DOTA-v1.5) and an ablation study showing the full module combination yields up to +4.4 percentage points over the baseline. The approach offers a practical, plug-and-play enhancement for UAV detection tasks with potential applicability to other CNN-based detectors and future extension to Transformer-based architectures.

Abstract

Object detection in unmanned aerial vehicle (UAV) remote sensing images poses significant challenges due to unstable image quality, small object sizes, complex backgrounds, and environmental occlusions. Small objects, in particular, occupy small portions of images, making their accurate detection highly difficult. Existing multi-scale feature fusion methods address these challenges to some extent by aggregating features across different resolutions. However, they often fail to effectively balance the classification and localization performance for small objects, primarily due to insufficient feature representation and imbalanced network information flow. In this paper, we propose a novel feature fusion framework specifically designed for UAV object detection tasks to enhance both localization accuracy and classification performance. The proposed framework integrates hybrid upsampling and downsampling modules, enabling feature maps from different network depths to be flexibly adjusted to arbitrary resolutions. This design facilitates cross-layer connections and multi-scale feature fusion, ensuring improved representation of small objects. Our approach leverages hybrid downsampling to enhance fine-grained feature representation, improving spatial localization of small targets, even under complex conditions. Simultaneously, the upsampling module aggregates global contextual information, optimizing feature consistency across scales and enhancing classification robustness in cluttered scenes. Experimental results on two public UAV datasets demonstrate the effectiveness of the proposed framework. Integrated into the YOLO-v10 model, our method achieves a 2% improvement in average precision (AP) compared to the baseline YOLO-v10 model, while maintaining the same number of parameters. These results highlight the potential of our framework for accurate and efficient UAV object detection.

Efficient Feature Fusion for UAV Object Detection

TL;DR

This work tackles the imbalance between classification and localization in UAV object detection by introducing a Fusion Multi-Head Self-Attention framework (FMSA) that employs Fusion Down Sample (FDS) and Fusion Up Sample (FUS) modules to enable cross-layer, multi-scale feature fusion within CNNs. Integrated into YOLO-v10, the framework preserves parameter count while enhancing small-object localization and overall classification, leveraging MHSA-based fusion and local/global attention mechanisms. Experimental results on VisDrone2019 and DOTA-v1.5 demonstrate consistent improvements in (e.g., +2.1 percentage points on VisDrone2019 and +2.0 on DOTA-v1.5) and an ablation study showing the full module combination yields up to +4.4 percentage points over the baseline. The approach offers a practical, plug-and-play enhancement for UAV detection tasks with potential applicability to other CNN-based detectors and future extension to Transformer-based architectures.

Abstract

Object detection in unmanned aerial vehicle (UAV) remote sensing images poses significant challenges due to unstable image quality, small object sizes, complex backgrounds, and environmental occlusions. Small objects, in particular, occupy small portions of images, making their accurate detection highly difficult. Existing multi-scale feature fusion methods address these challenges to some extent by aggregating features across different resolutions. However, they often fail to effectively balance the classification and localization performance for small objects, primarily due to insufficient feature representation and imbalanced network information flow. In this paper, we propose a novel feature fusion framework specifically designed for UAV object detection tasks to enhance both localization accuracy and classification performance. The proposed framework integrates hybrid upsampling and downsampling modules, enabling feature maps from different network depths to be flexibly adjusted to arbitrary resolutions. This design facilitates cross-layer connections and multi-scale feature fusion, ensuring improved representation of small objects. Our approach leverages hybrid downsampling to enhance fine-grained feature representation, improving spatial localization of small targets, even under complex conditions. Simultaneously, the upsampling module aggregates global contextual information, optimizing feature consistency across scales and enhancing classification robustness in cluttered scenes. Experimental results on two public UAV datasets demonstrate the effectiveness of the proposed framework. Integrated into the YOLO-v10 model, our method achieves a 2% improvement in average precision (AP) compared to the baseline YOLO-v10 model, while maintaining the same number of parameters. These results highlight the potential of our framework for accurate and efficient UAV object detection.

Paper Structure

This paper contains 20 sections, 11 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Illustration of our framework. (a) Backbone with fusion down sample (FDS) module added for collecting shallow layer features. (b) Neck of network. (c) Fusion up sample (FUS) module for collecting deeper layer features. (d) Detection head. (e) Fusion multi-head self-attention (FMSA) module for the feature fusion of outputs from (a), (b), and (c).
  • Figure 2: Illustration of our model architecture. The proposed method is a supplementary component and integrated to the state-of-the-art model YOLO-v10, including FDS, FUS, FMSA modules.
  • Figure 3: Illustration of FMSA module. It is an additional module to CNN-based network. Apart from the main output of the network, the FDS module collects shallow layer features, and the FUS module collects deeper layer features for feature fusion conducted by the FMSA module.
  • Figure 4: Illustration of FDS module. On the right is the network structure of the FDS module, and on the left is the downsampling module of YOLO-v10 for comparison.
  • Figure 5: Illustration of FUS module. It performs up-sampling operation for the target deeper layers.
  • ...and 2 more figures