Table of Contents
Fetching ...

DEIM: DETR with Improved Matching for Fast Convergence

Shihua Huang, Zhichao Lu, Xiaodong Cun, Yongjun Yu, Xiao Zhou, Xi Shen

TL;DR

DEIM tackles slow DETR convergence in real-time object detection by introducing Dense O2O, which increases supervision per image through mosaic and mixup, and MAL, a loss that accounts for match quality. This combination accelerates training and improves detection accuracy without adding compute through extra decoders. Across COCO, DEIM yields faster convergence, higher AP, and strong small-object performance, while maintaining latency; it also generalizes well to datasets like CrowdHuman and to backbone-augmented DETRs. The approach sets a new baseline for real-time DETR-based detectors and is released with code and pre-trained models for broader adoption.

Abstract

We introduce DEIM, an innovative and efficient training framework designed to accelerate convergence in real-time object detection with Transformer-based architectures (DETR). To mitigate the sparse supervision inherent in one-to-one (O2O) matching in DETR models, DEIM employs a Dense O2O matching strategy. This approach increases the number of positive samples per image by incorporating additional targets, using standard data augmentation techniques. While Dense O2O matching speeds up convergence, it also introduces numerous low-quality matches that could affect performance. To address this, we propose the Matchability-Aware Loss (MAL), a novel loss function that optimizes matches across various quality levels, enhancing the effectiveness of Dense O2O. Extensive experiments on the COCO dataset validate the efficacy of DEIM. When integrated with RT-DETR and D-FINE, it consistently boosts performance while reducing training time by 50%. Notably, paired with RT-DETRv2, DEIM achieves 53.2% AP in a single day of training on an NVIDIA 4090 GPU. Additionally, DEIM-trained real-time models outperform leading real-time object detectors, with DEIM-D-FINE-L and DEIM-D-FINE-X achieving 54.7% and 56.5% AP at 124 and 78 FPS on an NVIDIA T4 GPU, respectively, without the need for additional data. We believe DEIM sets a new baseline for advancements in real-time object detection. Our code and pre-trained models are available at https://github.com/ShihuaHuang95/DEIM.

DEIM: DETR with Improved Matching for Fast Convergence

TL;DR

DEIM tackles slow DETR convergence in real-time object detection by introducing Dense O2O, which increases supervision per image through mosaic and mixup, and MAL, a loss that accounts for match quality. This combination accelerates training and improves detection accuracy without adding compute through extra decoders. Across COCO, DEIM yields faster convergence, higher AP, and strong small-object performance, while maintaining latency; it also generalizes well to datasets like CrowdHuman and to backbone-augmented DETRs. The approach sets a new baseline for real-time DETR-based detectors and is released with code and pre-trained models for broader adoption.

Abstract

We introduce DEIM, an innovative and efficient training framework designed to accelerate convergence in real-time object detection with Transformer-based architectures (DETR). To mitigate the sparse supervision inherent in one-to-one (O2O) matching in DETR models, DEIM employs a Dense O2O matching strategy. This approach increases the number of positive samples per image by incorporating additional targets, using standard data augmentation techniques. While Dense O2O matching speeds up convergence, it also introduces numerous low-quality matches that could affect performance. To address this, we propose the Matchability-Aware Loss (MAL), a novel loss function that optimizes matches across various quality levels, enhancing the effectiveness of Dense O2O. Extensive experiments on the COCO dataset validate the efficacy of DEIM. When integrated with RT-DETR and D-FINE, it consistently boosts performance while reducing training time by 50%. Notably, paired with RT-DETRv2, DEIM achieves 53.2% AP in a single day of training on an NVIDIA 4090 GPU. Additionally, DEIM-trained real-time models outperform leading real-time object detectors, with DEIM-D-FINE-L and DEIM-D-FINE-X achieving 54.7% and 56.5% AP at 124 and 78 FPS on an NVIDIA T4 GPU, respectively, without the need for additional data. We believe DEIM sets a new baseline for advancements in real-time object detection. Our code and pre-trained models are available at https://github.com/ShihuaHuang95/DEIM.

Paper Structure

This paper contains 34 sections, 4 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 0: Comparison with state-of-the-art real-time object detectors on COCO lin2014microsoft. The proposed DEIM achieves faster convergence (a) and superior performance in terms of average precision (AP) and latency (b) when compared to state-of-the-art real-time object detectors.
  • Figure 1: An illustration of our proposed DEIM.Yellow, red, and green boxes represent the GT, positive and negative samples, respectively. 'pos.' denotes the positive samples. Top: Our Dense O2O (Fig. \ref{['fig:toy_Dense_O2O']}) can provide the same quality of positive samples as O2M (Fig. \ref{['fig:toy_O2M']}). Bottom: For the low-quality matching, its loss values when using VFL zhang2021varifocalnet and MAL are marked by $\star$, indicating MAL can optimize those cases more effectively.
  • Figure 2: Anchor/Query Match Comparison. Comparison of the number of matched anchors/queries per image in one COCO epoch using one-to-many (SimOTA zheng2021yolox) and one-to-one (Hungarian carion2020end) matching schemes.
  • Figure 3: VFL vs. MAL Comparison. Comparison of VFL and our MAL for low-quality (IoU = 0.05, Fig. \ref{['fig:low_quality']}) and high-quality (IoU = 0.95, Fig. \ref{['fig:high_quality']}) matching cases.
  • Figure 4: An illustrated example of our proposed novel training scheme for learning rate and data augmentation scheduler.
  • ...and 2 more figures