Better Sampling, towards Better End-to-end Small Object Detection
Zile Huang, Chong Zhang, Mingyu Jin, Fangyu Wu, Chengzhi Liu, Xiaobo Jin
TL;DR
This work targets small object detection within end-to-end transformer-based detectors. It introduces three sampling-centric techniques—Sample Points Refinement (SPR), Scale-aligned Target (ST), and task-decoupled Sample Reweighting (SR)—to improve localization, classification, and training emphasis. Empirical results on VisDrone and SODA-D show consistent AP gains over state-of-the-art end-to-end detectors, validating the effectiveness of refined sampling and scale-aware confidence estimation for tiny objects. The proposed approach preserves inference speed while delivering notable improvements in challenging dense scenes with overlapping small targets.
Abstract
While deep learning-based general object detection has made significant strides in recent years, the effectiveness and efficiency of small object detection remain unsatisfactory. This is primarily attributed not only to the limited characteristics of such small targets but also to the high density and mutual overlap among these targets. The existing transformer-based small object detectors do not leverage the gap between accuracy and inference speed. To address challenges, we propose methods enhancing sampling within an end-to-end framework. Sample Points Refinement (SPR) constrains localization and attention, preserving meaningful interactions in the region of interest and filtering out misleading information. Scale-aligned Target (ST) integrates scale information into target confidence, improving classification for small object detection. A task-decoupled Sample Reweighting (SR) mechanism guides attention toward challenging positive examples, utilizing a weight generator module to assess the difficulty and adjust classification loss based on decoder layer outcomes. Comprehensive experiments across various benchmarks reveal that our proposed detector excels in detecting small objects. Our model demonstrates a significant enhancement, achieving a 2.9\% increase in average precision (AP) over the state-of-the-art (SOTA) on the VisDrone dataset and a 1.7\% improvement on the SODA-D dataset.
