Table of Contents
Fetching ...

High-Frequency Semantics and Geometric Priors for End-to-End Detection Transformers in Challenging UAV Imagery

Hongxing Peng, Lide Chen, Hui Zhu, Yan Chen

TL;DR

This work tackles real-time object detection in challenging UAV imagery, where small, densely packed, and occluded objects in cluttered backgrounds hinder performance. It introduces HEDS-DETR, a holistically enhanced Detection Transformer built around four components: HFESNet to preserve high-frequency spatial details, ESOP to efficiently fuse high-resolution features for small objects, and SQR plus GAPE to stabilize decoding with geometry-aware priors. Empirical results on VisDrone show substantial gains in AP and AP50 while reducing model size and preserving real-time speed, with strong generalization to CARPK. The approach delivers a superior accuracy–efficiency trade-off for dense aerial detection and offers guidance for extending frequency-aware and geometry-informed decoding to other dense prediction tasks in aerial vision.

Abstract

Object detection in Unmanned Aerial Vehicle (UAV) imagery is fundamentally challenged by a prevalence of small, densely packed, and occluded objects within cluttered backgrounds. Conventional detectors struggle with this domain, as they rely on hand-crafted components like pre-defined anchors and heuristic-based Non-Maximum Suppression (NMS), creating a well-known performance bottleneck in dense scenes. Even recent end-to-end frameworks have not been purpose-built to overcome these specific aerial challenges, resulting in a persistent performance gap. To bridge this gap, we introduce HEDS-DETR, a holistically enhanced real-time Detection Transformer tailored for aerial scenes. Our framework features three key innovations. First, we propose a novel High-Frequency Enhanced Semantics Network (HFESNet) backbone, which yields highly discriminative features by preserving critical high-frequency details alongside robust semantic context. Second, our Efficient Small Object Pyramid (ESOP) counteracts information loss by efficiently fusing high-resolution features, significantly boosting small object detection. Finally, we enhance decoder stability and localization precision with two synergistic components: Selective Query Recollection (SQR) and Geometry-Aware Positional Encoding (GAPE), which stabilize optimization and provide explicit spatial priors for dense object arrangements. On the VisDrone dataset, HEDS-DETR achieves a +3.8% AP and +5.1% AP50 gain over its baseline while reducing parameters by 4M and maintaining real-time speeds. This demonstrates a highly competitive accuracy-efficiency balance, especially for detecting dense and small objects in aerial scenes.

High-Frequency Semantics and Geometric Priors for End-to-End Detection Transformers in Challenging UAV Imagery

TL;DR

This work tackles real-time object detection in challenging UAV imagery, where small, densely packed, and occluded objects in cluttered backgrounds hinder performance. It introduces HEDS-DETR, a holistically enhanced Detection Transformer built around four components: HFESNet to preserve high-frequency spatial details, ESOP to efficiently fuse high-resolution features for small objects, and SQR plus GAPE to stabilize decoding with geometry-aware priors. Empirical results on VisDrone show substantial gains in AP and AP50 while reducing model size and preserving real-time speed, with strong generalization to CARPK. The approach delivers a superior accuracy–efficiency trade-off for dense aerial detection and offers guidance for extending frequency-aware and geometry-informed decoding to other dense prediction tasks in aerial vision.

Abstract

Object detection in Unmanned Aerial Vehicle (UAV) imagery is fundamentally challenged by a prevalence of small, densely packed, and occluded objects within cluttered backgrounds. Conventional detectors struggle with this domain, as they rely on hand-crafted components like pre-defined anchors and heuristic-based Non-Maximum Suppression (NMS), creating a well-known performance bottleneck in dense scenes. Even recent end-to-end frameworks have not been purpose-built to overcome these specific aerial challenges, resulting in a persistent performance gap. To bridge this gap, we introduce HEDS-DETR, a holistically enhanced real-time Detection Transformer tailored for aerial scenes. Our framework features three key innovations. First, we propose a novel High-Frequency Enhanced Semantics Network (HFESNet) backbone, which yields highly discriminative features by preserving critical high-frequency details alongside robust semantic context. Second, our Efficient Small Object Pyramid (ESOP) counteracts information loss by efficiently fusing high-resolution features, significantly boosting small object detection. Finally, we enhance decoder stability and localization precision with two synergistic components: Selective Query Recollection (SQR) and Geometry-Aware Positional Encoding (GAPE), which stabilize optimization and provide explicit spatial priors for dense object arrangements. On the VisDrone dataset, HEDS-DETR achieves a +3.8% AP and +5.1% AP50 gain over its baseline while reducing parameters by 4M and maintaining real-time speeds. This demonstrates a highly competitive accuracy-efficiency balance, especially for detecting dense and small objects in aerial scenes.

Paper Structure

This paper contains 20 sections, 13 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Challenges in UAV Object Detection: dense small objects and cluttered backgrounds
  • Figure 2: Overview of HEDS-DETR.
  • Figure 3: The architecture of our proposed CSP-FCA module, which integrates the efficiency of the CSP strategy with the detail-recovery capabilities of the FCA block.
  • Figure 4: The architecture of the proposed Cross-scale Omni-Kernel Block (COKBlock). FFT, IFFT, and GAP denote the Fourier Transform, Inverse Fourier Transform, and Global Average Pooling, respectively.
  • Figure 5: Illustrates the GAPE strategy at decoder. The small green rectangles in the diagram denote the positional encodings we designed.
  • ...and 4 more figures