Table of Contents
Fetching ...

RT-DETRv3: Real-time End-to-End Object Detection with Hierarchical Dense Positive Supervision

Shuo Wang, Chunlong Xia, Feng Lv, Yifeng Shi

TL;DR

RT-DETRv3 addresses the sparse supervision issue in real-time end-to-end transformer detectors by introducing hierarchical dense positive supervision during training. It combines a CNN-based one-to-many auxiliary branch, multi-group self-attention perturbation in the transformer decoder, and a shared-weight one-to-many dense supervision branch, all training-only to preserve inference speed. On COCO val2017, RT-DETRv3 achieves state-of-the-art real-time accuracy, e.g., 48.1 AP for RT-DETRv3-R18 and 54.6 AP for RT-DETRv3-R101, while maintaining comparable latency to RT-DETR baselines. The approach accelerates convergence and delivers notable performance gains without adding inference overhead, making it practical for real-time applications.

Abstract

RT-DETR is the first real-time end-to-end transformer-based object detector. Its efficiency comes from the framework design and the Hungarian matching. However, compared to dense supervision detectors like the YOLO series, the Hungarian matching provides much sparser supervision, leading to insufficient model training and difficult to achieve optimal results. To address these issues, we proposed a hierarchical dense positive supervision method based on RT-DETR, named RT-DETRv3. Firstly, we introduce a CNN-based auxiliary branch that provides dense supervision that collaborates with the original decoder to enhance the encoder feature representation. Secondly, to address insufficient decoder training, we propose a novel learning strategy involving self-attention perturbation. This strategy diversifies label assignment for positive samples across multiple query groups, thereby enriching positive supervisions. Additionally, we introduce a shared-weight decoder branch for dense positive supervision to ensure more high-quality queries matching each ground truth. Notably, all aforementioned modules are training-only. We conduct extensive experiments to demonstrate the effectiveness of our approach on COCO val2017. RT-DETRv3 significantly outperforms existing real-time detectors, including the RT-DETR series and the YOLO series. For example, RT-DETRv3-R18 achieves 48.1% AP (+1.6%/+1.4%) compared to RT-DETR-R18/RT-DETRv2-R18, while maintaining the same latency. Furthermore, RT-DETRv3-R101 can attain an impressive 54.6% AP outperforming YOLOv10-X. The code will be released at https://github.com/clxia12/RT-DETRv3.

RT-DETRv3: Real-time End-to-End Object Detection with Hierarchical Dense Positive Supervision

TL;DR

RT-DETRv3 addresses the sparse supervision issue in real-time end-to-end transformer detectors by introducing hierarchical dense positive supervision during training. It combines a CNN-based one-to-many auxiliary branch, multi-group self-attention perturbation in the transformer decoder, and a shared-weight one-to-many dense supervision branch, all training-only to preserve inference speed. On COCO val2017, RT-DETRv3 achieves state-of-the-art real-time accuracy, e.g., 48.1 AP for RT-DETRv3-R18 and 54.6 AP for RT-DETRv3-R101, while maintaining comparable latency to RT-DETR baselines. The approach accelerates convergence and delivers notable performance gains without adding inference overhead, making it practical for real-time applications.

Abstract

RT-DETR is the first real-time end-to-end transformer-based object detector. Its efficiency comes from the framework design and the Hungarian matching. However, compared to dense supervision detectors like the YOLO series, the Hungarian matching provides much sparser supervision, leading to insufficient model training and difficult to achieve optimal results. To address these issues, we proposed a hierarchical dense positive supervision method based on RT-DETR, named RT-DETRv3. Firstly, we introduce a CNN-based auxiliary branch that provides dense supervision that collaborates with the original decoder to enhance the encoder feature representation. Secondly, to address insufficient decoder training, we propose a novel learning strategy involving self-attention perturbation. This strategy diversifies label assignment for positive samples across multiple query groups, thereby enriching positive supervisions. Additionally, we introduce a shared-weight decoder branch for dense positive supervision to ensure more high-quality queries matching each ground truth. Notably, all aforementioned modules are training-only. We conduct extensive experiments to demonstrate the effectiveness of our approach on COCO val2017. RT-DETRv3 significantly outperforms existing real-time detectors, including the RT-DETR series and the YOLO series. For example, RT-DETRv3-R18 achieves 48.1% AP (+1.6%/+1.4%) compared to RT-DETR-R18/RT-DETRv2-R18, while maintaining the same latency. Furthermore, RT-DETRv3-R101 can attain an impressive 54.6% AP outperforming YOLOv10-X. The code will be released at https://github.com/clxia12/RT-DETRv3.
Paper Structure (19 sections, 5 equations, 4 figures, 5 tables)

This paper contains 19 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Compared to other real-time object detectors. Our method has better performance in the trade-off between speed and accuracy. $*$ represents adding extra data.
  • Figure 2: Architecture of RT-DETRv3. We preserve the core architecture of RT-DETR(highlighted in yellow)and propose a novel hierarchical decoupled dense supervision method (emphasized in green). Firstly, we enhance the encoder's representation capability by incorporating a CNN-based one-to-many label assignment auxiliary branch. Secondly, to enhance and strengthen supervision of the decoder, we generate multiple object queries (OQ) through the query selection module and apply random masking to perturb the self-attention mechanism, effectively diversifying the distribution of positive query samples. Additionally, to ensure that multiple relevant queries focus on the same target, we introduce a supplementary one-to-many matching branch. Notably, these auxiliary branches are discarded during evaluation.
  • Figure 3: Mask self-attention module.$M_{i}$ represents the perturbation mask corresponding to the $i$-$th$ set of object queries. $\otimes$ denotes matrix multiplication, and $\odot$ denotes element-wise multiplication.
  • Figure 4: Convergence curves of RT-DETRv3 across different model sizes. $\star$ represents the best AP.