Table of Contents
Fetching ...

Cross-DINO: Cross the Deep MLP and Transformer for Small Object Detection

Guiping Cao, Wenjian Huang, Xiangyuan Lan, Jianguo Zhang, Dongmei Jiang, Yaowei Wang

TL;DR

Cross-DINO tackles small object detection within DETR-like architectures by integrating a deep MLP backbone (via CLAP-Strip-MLP) to enrich initial features with both short- and long-range context, and by introducing a Cross Coding Twice Module (CCTM) to progressively fuse backbone and encoder details for finer object representation. Its Boost Loss uses a Category-Size soft label to modulate classification penalties, explicitly boosting small-object predictions. Across COCO, WiderPerson, VisDrone, AI-TOD, and SODA-D, Cross-DINO delivers consistent SOD gains with modest parameter counts and efficient training (12 epochs), notably achieving 36.4% AP_S on COCO with 45M parameters. These results demonstrate effective cross-domain enhancement of DETR-like detectors for small objects through architectural innovation and size-aware supervision.

Abstract

Small Object Detection (SOD) poses significant challenges due to limited information and the model's low class prediction score. While Transformer-based detectors have shown promising performance, their potential for SOD remains largely unexplored. In typical DETR-like frameworks, the CNN backbone network, specialized in aggregating local information, struggles to capture the necessary contextual information for SOD. The multiple attention layers in the Transformer Encoder face difficulties in effectively attending to small objects and can also lead to blurring of features. Furthermore, the model's lower class prediction score of small objects compared to large objects further increases the difficulty of SOD. To address these challenges, we introduce a novel approach called Cross-DINO. This approach incorporates the deep MLP network to aggregate initial feature representations with both short and long range information for SOD. Then, a new Cross Coding Twice Module (CCTM) is applied to integrate these initial representations to the Transformer Encoder feature, enhancing the details of small objects. Additionally, we introduce a new kind of soft label named Category-Size (CS), integrating the Category and Size of objects. By treating CS as new ground truth, we propose a new loss function called Boost Loss to improve the class prediction score of the model. Extensive experimental results on COCO, WiderPerson, VisDrone, AI-TOD, and SODA-D datasets demonstrate that Cross-DINO efficiently improves the performance of DETR-like models on SOD. Specifically, our model achieves 36.4% APs on COCO for SOD with only 45M parameters, outperforming the DINO by +4.4% APS (36.4% vs. 32.0%) with fewer parameters and FLOPs, under 12 epochs training setting. The source codes will be available at https://github.com/Med-Process/Cross-DINO.

Cross-DINO: Cross the Deep MLP and Transformer for Small Object Detection

TL;DR

Cross-DINO tackles small object detection within DETR-like architectures by integrating a deep MLP backbone (via CLAP-Strip-MLP) to enrich initial features with both short- and long-range context, and by introducing a Cross Coding Twice Module (CCTM) to progressively fuse backbone and encoder details for finer object representation. Its Boost Loss uses a Category-Size soft label to modulate classification penalties, explicitly boosting small-object predictions. Across COCO, WiderPerson, VisDrone, AI-TOD, and SODA-D, Cross-DINO delivers consistent SOD gains with modest parameter counts and efficient training (12 epochs), notably achieving 36.4% AP_S on COCO with 45M parameters. These results demonstrate effective cross-domain enhancement of DETR-like detectors for small objects through architectural innovation and size-aware supervision.

Abstract

Small Object Detection (SOD) poses significant challenges due to limited information and the model's low class prediction score. While Transformer-based detectors have shown promising performance, their potential for SOD remains largely unexplored. In typical DETR-like frameworks, the CNN backbone network, specialized in aggregating local information, struggles to capture the necessary contextual information for SOD. The multiple attention layers in the Transformer Encoder face difficulties in effectively attending to small objects and can also lead to blurring of features. Furthermore, the model's lower class prediction score of small objects compared to large objects further increases the difficulty of SOD. To address these challenges, we introduce a novel approach called Cross-DINO. This approach incorporates the deep MLP network to aggregate initial feature representations with both short and long range information for SOD. Then, a new Cross Coding Twice Module (CCTM) is applied to integrate these initial representations to the Transformer Encoder feature, enhancing the details of small objects. Additionally, we introduce a new kind of soft label named Category-Size (CS), integrating the Category and Size of objects. By treating CS as new ground truth, we propose a new loss function called Boost Loss to improve the class prediction score of the model. Extensive experimental results on COCO, WiderPerson, VisDrone, AI-TOD, and SODA-D datasets demonstrate that Cross-DINO efficiently improves the performance of DETR-like models on SOD. Specifically, our model achieves 36.4% APs on COCO for SOD with only 45M parameters, outperforming the DINO by +4.4% APS (36.4% vs. 32.0%) with fewer parameters and FLOPs, under 12 epochs training setting. The source codes will be available at https://github.com/Med-Process/Cross-DINO.

Paper Structure

This paper contains 33 sections, 9 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Comparison of different models of AP and AP$_S$ w.r.t the different training epochs on val2017 of COCO. When compared to popular CNN-based and transformer-based models, Cross-DINO gets the higher AP$_S$ for SOD and AP for general OD.
  • Figure 2: Challenges of SOD. (a) In contrast to large objects, small objects often exhibit lower score in class prediction. (b) Objects become smaller when down-sampling occurs. Down-sampling reduces the image's spatial resolution and blurs the image, making it difficult to capture fine details of small objects.
  • Figure 3: The average class prediction scores of DINO-4scale zhang2022dino and Cross-DINO model with Boost loss under different object size on val2017 of COCO lin2014microsoft. It's worth noting that Cross-DINO detects additional 'harder' small objects than DINO. These 'harder' objects are detected with lower scores, resulting in a decrease in the average confidence scores. For a fair comparison, we computed the statistical results for objects with a detection confidence threshold of 0.4 or higher for both models.
  • Figure 4: The overall architecture of the proposed Cross-DINO. Our Cross-DINO architecture utilizes a 4-scale backbone feature maps for the decoder to ensure a fair comparison among all models in this study. The red solid and dashed lines highlight the differences between our model and DETR-like models, specifically emphasizing the new deep MLP backbone for enhancing initial representations, the novel CCTM module for strengthening details of objects, and the Boost loss designed to improve detection capabilities, particularly for small objects.
  • Figure 5: The architecture of Cross Coding Twice Module.
  • ...and 6 more figures