Table of Contents
Fetching ...

Small Object Detection by DETR via Information Augmentation and Adaptive Feature Fusion

Ji Huang, Hui Wang

TL;DR

The paper tackles the challenge of accurate small object detection under real-time constraints by identifying two limitations of RT-DETR: reliance on high-level semantic features and simplistic multi-scale feature fusion. It introduces Fine-Grained Path Augmentation to inject low-level detail into the Transformer input and Adoptive Feature Fusion to learn how to fuse multi-scale representations, aiming to improve small-object localization without sacrificing speed. Empirical results on the Aquarium Object Detection Dataset show that the proposed approach achieves strong overall performance ($AP^{val}$, $AP^{50}$, $AP^{75}$) and notably improves small-object accuracy ($AP^{S}$) over RT-DETR, while maintaining real-time feasibility. The work demonstrates practical gains for real-time small object detection in challenging environments and provides a straightforward extension to existing DETR-based pipelines.

Abstract

The main challenge for small object detection algorithms is to ensure accuracy while pursuing real-time performance. The RT-DETR model performs well in real-time object detection, but performs poorly in small object detection accuracy. In order to compensate for the shortcomings of the RT-DETR model in small object detection, two key improvements are proposed in this study. Firstly, The RT-DETR utilises a Transformer that receives input solely from the final layer of Backbone features. This means that the Transformer's input only receives semantic information from the highest level of abstraction in the Deep Network, and ignores detailed information such as edges, texture or color gradients that are critical to the location of small objects at lower levels of abstraction. Including only deep features can introduce additional background noise. This can have a negative impact on the accuracy of small object detection. To address this issue, we propose the fine-grained path augmentation method. This method helps to locate small objects more accurately by providing detailed information to the deep network. So, the input to the transformer contains both semantic and detailed information. Secondly, In RT-DETR, the decoder takes feature maps of different levels as input after concatenating them with equal weight. However, this operation is not effective in dealing with the complex relationship of multi-scale information captured by feature maps of different sizes. Therefore, we propose an adaptive feature fusion algorithm that assigns learnable parameters to each feature map from different levels. This allows the model to adaptively fuse feature maps from different levels and effectively integrate feature information from different scales. This enhances the model's ability to capture object features at different scales, thereby improving the accuracy of detecting small objects.

Small Object Detection by DETR via Information Augmentation and Adaptive Feature Fusion

TL;DR

The paper tackles the challenge of accurate small object detection under real-time constraints by identifying two limitations of RT-DETR: reliance on high-level semantic features and simplistic multi-scale feature fusion. It introduces Fine-Grained Path Augmentation to inject low-level detail into the Transformer input and Adoptive Feature Fusion to learn how to fuse multi-scale representations, aiming to improve small-object localization without sacrificing speed. Empirical results on the Aquarium Object Detection Dataset show that the proposed approach achieves strong overall performance (, , ) and notably improves small-object accuracy () over RT-DETR, while maintaining real-time feasibility. The work demonstrates practical gains for real-time small object detection in challenging environments and provides a straightforward extension to existing DETR-based pipelines.

Abstract

The main challenge for small object detection algorithms is to ensure accuracy while pursuing real-time performance. The RT-DETR model performs well in real-time object detection, but performs poorly in small object detection accuracy. In order to compensate for the shortcomings of the RT-DETR model in small object detection, two key improvements are proposed in this study. Firstly, The RT-DETR utilises a Transformer that receives input solely from the final layer of Backbone features. This means that the Transformer's input only receives semantic information from the highest level of abstraction in the Deep Network, and ignores detailed information such as edges, texture or color gradients that are critical to the location of small objects at lower levels of abstraction. Including only deep features can introduce additional background noise. This can have a negative impact on the accuracy of small object detection. To address this issue, we propose the fine-grained path augmentation method. This method helps to locate small objects more accurately by providing detailed information to the deep network. So, the input to the transformer contains both semantic and detailed information. Secondly, In RT-DETR, the decoder takes feature maps of different levels as input after concatenating them with equal weight. However, this operation is not effective in dealing with the complex relationship of multi-scale information captured by feature maps of different sizes. Therefore, we propose an adaptive feature fusion algorithm that assigns learnable parameters to each feature map from different levels. This allows the model to adaptively fuse feature maps from different levels and effectively integrate feature information from different scales. This enhances the model's ability to capture object features at different scales, thereby improving the accuracy of detecting small objects.
Paper Structure (14 sections, 2 equations, 1 figure, 2 tables)

This paper contains 14 sections, 2 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The overview of the model.FGPA means Fine-Grined Path Augmentation. AFU means Adoptively Feature Fusion.