Table of Contents
Fetching ...

RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer

Wenyu Lv, Yian Zhao, Qinyao Chang, Kui Huang, Guanzhong Wang, Yi Liu

TL;DR

RT-DETRv2 enhances real-time DETR by introducing scale-aware sampling in deformable attention, plus an optional discrete sampling operator to improve deployability. It also introduces training strategies—dynamic data augmentation and scale-adaptive hyperparameters—that boost accuracy without sacrificing speed. Across COCO, RT-DETRv2 consistently improves AP over RT-DETR for multiple backbone sizes while maintaining real-time throughput. The combination of architectural freebies and smarter training offers a more flexible, practical baseline for real-time detection transformers.

Abstract

In this report, we present RT-DETRv2, an improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 builds upon the previous state-of-the-art real-time detector, RT-DETR, and opens up a set of bag-of-freebies for flexibility and practicality, as well as optimizing the training strategy to achieve enhanced performance. To improve the flexibility, we suggest setting a distinct number of sampling points for features at different scales in the deformable attention to achieve selective multi-scale feature extraction by the decoder. To enhance practicality, we propose an optional discrete sampling operator to replace the grid_sample operator that is specific to RT-DETR compared to YOLOs. This removes the deployment constraints typically associated with DETRs. For the training strategy, we propose dynamic data augmentation and scale-adaptive hyperparameters customization to improve performance without loss of speed. Source code and pre-trained models will be available at https://github.com/lyuwenyu/RT-DETR.

RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer

TL;DR

RT-DETRv2 enhances real-time DETR by introducing scale-aware sampling in deformable attention, plus an optional discrete sampling operator to improve deployability. It also introduces training strategies—dynamic data augmentation and scale-adaptive hyperparameters—that boost accuracy without sacrificing speed. Across COCO, RT-DETRv2 consistently improves AP over RT-DETR for multiple backbone sizes while maintaining real-time throughput. The combination of architectural freebies and smarter training offers a more flexible, practical baseline for real-time detection transformers.

Abstract

In this report, we present RT-DETRv2, an improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 builds upon the previous state-of-the-art real-time detector, RT-DETR, and opens up a set of bag-of-freebies for flexibility and practicality, as well as optimizing the training strategy to achieve enhanced performance. To improve the flexibility, we suggest setting a distinct number of sampling points for features at different scales in the deformable attention to achieve selective multi-scale feature extraction by the decoder. To enhance practicality, we propose an optional discrete sampling operator to replace the grid_sample operator that is specific to RT-DETR compared to YOLOs. This removes the deployment constraints typically associated with DETRs. For the training strategy, we propose dynamic data augmentation and scale-adaptive hyperparameters customization to improve performance without loss of speed. Source code and pre-trained models will be available at https://github.com/lyuwenyu/RT-DETR.
Paper Structure (10 sections, 4 tables)