Table of Contents
Fetching ...

CollabOD: Collaborative Multi-Backbone with Cross-scale Vision for UAV Small Object Detection

Xuecheng Bai, Yuxiang Wang, Chuanzhi Xu, Boyu Hu, Kang Han, Ruijie Pan, Xiaowei Niu, Xiaotian Guan, Liqiang Fu, Pengfei Ye

TL;DR

CollabOD, a lightweight collaborative detection framework that explicitly preserves structural details and aligns heterogeneous feature streams before multi-scale fusion, is proposed that optimizes the architecture of conventional UAV perception models.

Abstract

Small object detection in unmanned aerial vehicle (UAV) imagery is challenging, mainly due to scale variation, structural detail degradation, and limited computational resources. In high-altitude scenarios, fine-grained features are further weakened during hierarchical downsampling and cross-scale fusion, resulting in unstable localization and reduced robustness. To address this issue, we propose CollabOD, a lightweight collaborative detection framework that explicitly preserves structural details and aligns heterogeneous feature streams before multi-scale fusion. The framework integrates Structural Detail Preservation, Cross-Path Feature Alignment, and Localization-Aware Lightweight Design strategies. From the perspectives of image processing, channel structure, and lightweight design, it optimizes the architecture of conventional UAV perception models. The proposed design enhances representation stability while maintaining efficient inference. A unified detail-aware detection head further improves regression robustness without introducing additional deployment overhead. The code is available at: https://github.com/Bai-Xuecheng/CollabOD.

CollabOD: Collaborative Multi-Backbone with Cross-scale Vision for UAV Small Object Detection

TL;DR

CollabOD, a lightweight collaborative detection framework that explicitly preserves structural details and aligns heterogeneous feature streams before multi-scale fusion, is proposed that optimizes the architecture of conventional UAV perception models.

Abstract

Small object detection in unmanned aerial vehicle (UAV) imagery is challenging, mainly due to scale variation, structural detail degradation, and limited computational resources. In high-altitude scenarios, fine-grained features are further weakened during hierarchical downsampling and cross-scale fusion, resulting in unstable localization and reduced robustness. To address this issue, we propose CollabOD, a lightweight collaborative detection framework that explicitly preserves structural details and aligns heterogeneous feature streams before multi-scale fusion. The framework integrates Structural Detail Preservation, Cross-Path Feature Alignment, and Localization-Aware Lightweight Design strategies. From the perspectives of image processing, channel structure, and lightweight design, it optimizes the architecture of conventional UAV perception models. The proposed design enhances representation stability while maintaining efficient inference. A unified detail-aware detection head further improves regression robustness without introducing additional deployment overhead. The code is available at: https://github.com/Bai-Xuecheng/CollabOD.
Paper Structure (26 sections, 9 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 9 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of conventional single-stream detection and CollabOD. Single-stream methods attenuate structural cues and perform implicit fusion, resulting in spatial misalignment. CollabOD decouples and aligns structural and detail representations prior to fusion for improving stability and accuracy.
  • Figure 2: Overview of the proposed CollabOD framework. DPF-Stem denotes the Dual-Path Fusion Stem, DABlock represents Dense Aggregation Block, and BRM refers to Bilateral Reweighting Module. The UDA Head corresponds to the Unified Detail-Aware Head, which is detailed in Section \ref{['sec:Localization-Aware Lightweight Design']}. The remaining components are inherited from the original YOLO11 architecture.
  • Figure 3: Effective Receptive Field Visualization Across DABlock Layers.
  • Figure 4: Qualitative comparison between the baseline and CollabOD on VisDrone-2019-DET. In complex aerial scenes with predominantly small objects, CollabOD exhibits a lower miss rate and more accurate localization than the baseline model.
  • Figure 5: Visualization of the detection results of ClusDet and the proposed method. Two representative cases are selected, and a focused comparison is conducted on the highlighted regions.