Table of Contents
Fetching ...

SO-DETR: Leveraging Dual-Domain Features and Knowledge Distillation for Small Object Detection

Huaxiang Zhang, Hao Zhang, Aoran Mei, Zhongxue Gan, Guo-Niu Zhu

TL;DR

This work tackles the persistent challenge of small object detection in DETR-like architectures by introducing SO-DETR, which fuses spatial and frequency-domain features through a dual-domain hybrid encoder, optimizes query allocation with an Expanded-IoU based mechanism, and leverages knowledge distillation to maintain efficiency with a lightweight backbone. The method demonstrates competitive accuracy gains on UAV-focused benchmarks VisDrone-2019-DET and UAVVaste while reducing computational overhead, and ablations confirm the complementary contributions of each component. By targeting final decoder outputs in the distillation process and employing a linear decay schedule, SO-DETR achieves effective knowledge transfer and improved small-object localization. Overall, the approach offers a practical, efficient path for real-time small-object detection in aerial imagery and points to balancing high-resolution feature extraction with semantic understanding for large objects as a future priority.

Abstract

Detection Transformer-based methods have achieved significant advancements in general object detection. However, challenges remain in effectively detecting small objects. One key difficulty is that existing encoders struggle to efficiently fuse low-level features. Additionally, the query selection strategies are not effectively tailored for small objects. To address these challenges, this paper proposes an efficient model, Small Object Detection Transformer (SO-DETR). The model comprises three key components: a dual-domain hybrid encoder, an enhanced query selection mechanism, and a knowledge distillation strategy. The dual-domain hybrid encoder integrates spatial and frequency domains to fuse multi-scale features effectively. This approach enhances the representation of high-resolution features while maintaining relatively low computational overhead. The enhanced query selection mechanism optimizes query initialization by dynamically selecting high-scoring anchor boxes using expanded IoU, thereby improving the allocation of query resources. Furthermore, by incorporating a lightweight backbone network and implementing a knowledge distillation strategy, we develop an efficient detector for small objects. Experimental results on the VisDrone-2019-DET and UAVVaste datasets demonstrate that SO-DETR outperforms existing methods with similar computational demands. The project page is available at https://github.com/ValiantDiligent/SO_DETR.

SO-DETR: Leveraging Dual-Domain Features and Knowledge Distillation for Small Object Detection

TL;DR

This work tackles the persistent challenge of small object detection in DETR-like architectures by introducing SO-DETR, which fuses spatial and frequency-domain features through a dual-domain hybrid encoder, optimizes query allocation with an Expanded-IoU based mechanism, and leverages knowledge distillation to maintain efficiency with a lightweight backbone. The method demonstrates competitive accuracy gains on UAV-focused benchmarks VisDrone-2019-DET and UAVVaste while reducing computational overhead, and ablations confirm the complementary contributions of each component. By targeting final decoder outputs in the distillation process and employing a linear decay schedule, SO-DETR achieves effective knowledge transfer and improved small-object localization. Overall, the approach offers a practical, efficient path for real-time small-object detection in aerial imagery and points to balancing high-resolution feature extraction with semantic understanding for large objects as a future priority.

Abstract

Detection Transformer-based methods have achieved significant advancements in general object detection. However, challenges remain in effectively detecting small objects. One key difficulty is that existing encoders struggle to efficiently fuse low-level features. Additionally, the query selection strategies are not effectively tailored for small objects. To address these challenges, this paper proposes an efficient model, Small Object Detection Transformer (SO-DETR). The model comprises three key components: a dual-domain hybrid encoder, an enhanced query selection mechanism, and a knowledge distillation strategy. The dual-domain hybrid encoder integrates spatial and frequency domains to fuse multi-scale features effectively. This approach enhances the representation of high-resolution features while maintaining relatively low computational overhead. The enhanced query selection mechanism optimizes query initialization by dynamically selecting high-scoring anchor boxes using expanded IoU, thereby improving the allocation of query resources. Furthermore, by incorporating a lightweight backbone network and implementing a knowledge distillation strategy, we develop an efficient detector for small objects. Experimental results on the VisDrone-2019-DET and UAVVaste datasets demonstrate that SO-DETR outperforms existing methods with similar computational demands. The project page is available at https://github.com/ValiantDiligent/SO_DETR.

Paper Structure

This paper contains 15 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of detection results between RT-DETR-R18 and our proposed SO-DETR-EV2. Within the yellow circle, our model outputs fewer overlapping bounding boxes.
  • Figure 2: Evolution of encoder architectures in DETR-based models and their encoder mechanisms. All models utilize feature layers extracted by a backbone network as input.
  • Figure 3: Overview of the SO-DETR architecture with knowledge distillation. Multi-scale features from the backbone’s four stages are fed into the dual-domain hybrid encoder, which transforms them into a sequence of image features. The enhanced query selection module then selects a fixed number of these features as initial object queries for the decoder. The decoder, with auxiliary prediction heads, iteratively refines the queries to predict object categories and bounding boxes. During knowledge distillation, the output of the teacher model’s decoder is used to compute the distillation loss with respect to the student model’s predictions.
  • Figure 4: The Dual-Domain Fusion block in encoder. FFT and IFFT denote the Fast Fourier Transform and Inverse Fast Fourier Transform, respectively.
  • Figure 5: Qualitative comparison of detection results and attention heatmaps between RT-DETR-EV2 and our proposed SO-DETR-EV2. Images are from the VisDrone-2019-DET dataset. The yellow boxes highlight areas where our model outperforms RT-DETR-EV2 by generating more precise attention distributions and detecting small and distant objects more effectively.