High-Frequency Semantics and Geometric Priors for End-to-End Detection Transformers in Challenging UAV Imagery

Hongxing Peng; Lide Chen; Hui Zhu; Yan Chen

High-Frequency Semantics and Geometric Priors for End-to-End Detection Transformers in Challenging UAV Imagery

Hongxing Peng, Lide Chen, Hui Zhu, Yan Chen

TL;DR

This work tackles real-time object detection in challenging UAV imagery, where small, densely packed, and occluded objects in cluttered backgrounds hinder performance. It introduces HEDS-DETR, a holistically enhanced Detection Transformer built around four components: HFESNet to preserve high-frequency spatial details, ESOP to efficiently fuse high-resolution features for small objects, and SQR plus GAPE to stabilize decoding with geometry-aware priors. Empirical results on VisDrone show substantial gains in AP and AP50 while reducing model size and preserving real-time speed, with strong generalization to CARPK. The approach delivers a superior accuracy–efficiency trade-off for dense aerial detection and offers guidance for extending frequency-aware and geometry-informed decoding to other dense prediction tasks in aerial vision.

Abstract

Object detection in Unmanned Aerial Vehicle (UAV) imagery is fundamentally challenged by a prevalence of small, densely packed, and occluded objects within cluttered backgrounds. Conventional detectors struggle with this domain, as they rely on hand-crafted components like pre-defined anchors and heuristic-based Non-Maximum Suppression (NMS), creating a well-known performance bottleneck in dense scenes. Even recent end-to-end frameworks have not been purpose-built to overcome these specific aerial challenges, resulting in a persistent performance gap. To bridge this gap, we introduce HEDS-DETR, a holistically enhanced real-time Detection Transformer tailored for aerial scenes. Our framework features three key innovations. First, we propose a novel High-Frequency Enhanced Semantics Network (HFESNet) backbone, which yields highly discriminative features by preserving critical high-frequency details alongside robust semantic context. Second, our Efficient Small Object Pyramid (ESOP) counteracts information loss by efficiently fusing high-resolution features, significantly boosting small object detection. Finally, we enhance decoder stability and localization precision with two synergistic components: Selective Query Recollection (SQR) and Geometry-Aware Positional Encoding (GAPE), which stabilize optimization and provide explicit spatial priors for dense object arrangements. On the VisDrone dataset, HEDS-DETR achieves a +3.8% AP and +5.1% AP50 gain over its baseline while reducing parameters by 4M and maintaining real-time speeds. This demonstrates a highly competitive accuracy-efficiency balance, especially for detecting dense and small objects in aerial scenes.

High-Frequency Semantics and Geometric Priors for End-to-End Detection Transformers in Challenging UAV Imagery

TL;DR

Abstract

High-Frequency Semantics and Geometric Priors for End-to-End Detection Transformers in Challenging UAV Imagery

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)