RTMDet: An Empirical Study of Designing Real-Time Object Detectors

Chengqi Lyu; Wenwei Zhang; Haian Huang; Yue Zhou; Yudong Wang; Yanyi Liu; Shilong Zhang; Kai Chen

RTMDet: An Empirical Study of Designing Real-Time Object Detectors

Chengqi Lyu, Wenwei Zhang, Haian Huang, Yue Zhou, Yudong Wang, Yanyi Liu, Shilong Zhang, Kai Chen

TL;DR

<3-5 sentence high-level summary> RTMDet introduces a real-time object detector that achieves state-of-the-art speed-accuracy using large-kernel depth-wise convolutions and a soft-label dynamic assignment scheme. The approach balances backbone and neck capacity, adopts a shared detection head, and employs caching-enabled data augmentation alongside a two-stage training schedule. It demonstrates strong results on COCO (52.8 AP at 300+ FPS) and extends to instance segmentation and rotated object detection with modest architectural additions, setting new baselines for real-time versatility. Overall, the work provides actionable design principles for scalable, high-performance real-time detection across multiple tasks and domains.

Abstract

In this paper, we aim to design an efficient real-time object detector that exceeds the YOLO series and is easily extensible for many object recognition tasks such as instance segmentation and rotated object detection. To obtain a more efficient model architecture, we explore an architecture that has compatible capacities in the backbone and neck, constructed by a basic building block that consists of large-kernel depth-wise convolutions. We further introduce soft labels when calculating matching costs in the dynamic label assignment to improve accuracy. Together with better training techniques, the resulting object detector, named RTMDet, achieves 52.8% AP on COCO with 300+ FPS on an NVIDIA 3090 GPU, outperforming the current mainstream industrial detectors. RTMDet achieves the best parameter-accuracy trade-off with tiny/small/medium/large/extra-large model sizes for various application scenarios, and obtains new state-of-the-art performance on real-time instance segmentation and rotated object detection. We hope the experimental results can provide new insights into designing versatile real-time object detectors for many object recognition tasks. Code and models are released at https://github.com/open-mmlab/mmdetection/tree/3.x/configs/rtmdet.

RTMDet: An Empirical Study of Designing Real-Time Object Detectors

TL;DR

Abstract

Paper Structure (43 sections, 4 equations, 4 figures, 22 tables)

This paper contains 43 sections, 4 equations, 4 figures, 22 tables.

Introduction
Related Work
Instance segmentation.
Rotated object detection.
Methodology
Macro Architecture
Model Architecture
Basic building block.
Balance of model width and depth.
Balance of backbone and neck.
Shared detection head.
Training Strategy
Label assignment and losses.
Cached Mosaic and MixUp.
Two-stage training.
...and 28 more sections

Figures (4)

Figure 1: Comparison of parameter and accuracy. (a) Comparison of RTMDet and other state-of-the-art real-time object detectors. (b) Comparison of RTMDet-Ins and other one-stage instance segmentation methods.
Figure 2: Macro architecture. We use CSP-blocks cspnet with large kernel depth-wise convolution layers to build the backbone. The multi-level features, noted as $C3$, $C4$, and $C5$, are extracted from the backbone and then fused in the CSP-PAFPN, which consists of the same block as the backbone. Then, detection heads with shared convolution weights and separated batch normalization (BN) layers are used to predict the classification and regression results for (rotated) bounding box detection. Extra heads can be added to produce dynamic convolution kernels and mask features for the instance segmentation task.
Figure 3: Different basic building blocks. (a) The basic bottleneck block of DarkNet used in YOLOv4YOLOv3YOLOXYOLOv5. (b) The proposed bottleneck block with a large-kernel depth-wise convolution layer. (c) Bottleneck block of PPYOLO-E PPYOLOE that uses re-parameterized convolution. (d) The basic unit of YOLOv6 YOLOv6.
Figure 4: Instance segmentation branch in RTMDet-Ins. The mask feature head has 4 convolution layers and predicts mask features of 8 channels condinst from the multi-level features extracted from neck. Two relative coordinate features are concatenated with the mask features to generate instance masks. The kernel head predicts a 169-dimensional vector for each instance. The vector is divided into three parts (lengths are 88, 72, and 9 respectively), which are used to form the kernels of three dynamic convolution layers.

RTMDet: An Empirical Study of Designing Real-Time Object Detectors

TL;DR

Abstract

RTMDet: An Empirical Study of Designing Real-Time Object Detectors

Authors

TL;DR

Abstract

Table of Contents

Figures (4)