MS-DETR: Efficient DETR Training with Mixed Supervision
Chuyang Zhao, Yifan Sun, Wenhao Wang, Qiang Chen, Errui Ding, Yi Yang, Jingdong Wang
TL;DR
MS-DETR addresses DETR's training inefficiency by introducing mixed supervision that adds a one-to-many objective to the primary decoder alongside the standard one-to-one supervision. The method defines a one-to-many loss with top-$K_n$ matching and a combined matching score to enrich object query representations without adding decoder branches. Empirically, MS-DETR delivers consistent gains across multiple DETR baselines, accelerates convergence, and remains memory-efficient, with notable improvements in both object detection and instance segmentation. The approach is complementary to existing one-to-many DETR variants and yields practical benefits for training efficiency and detection quality.
Abstract
DETR accomplishes end-to-end object detection through iteratively generating multiple object candidates based on image features and promoting one candidate for each ground-truth object. The traditional training procedure using one-to-one supervision in the original DETR lacks direct supervision for the object detection candidates. We aim at improving the DETR training efficiency by explicitly supervising the candidate generation procedure through mixing one-to-one supervision and one-to-many supervision. Our approach, namely MS-DETR, is simple, and places one-to-many supervision to the object queries of the primary decoder that is used for inference. In comparison to existing DETR variants with one-to-many supervision, such as Group DETR and Hybrid DETR, our approach does not need additional decoder branches or object queries. The object queries of the primary decoder in our approach directly benefit from one-to-many supervision and thus are superior in object candidate prediction. Experimental results show that our approach outperforms related DETR variants, such as DN-DETR, Hybrid DETR, and Group DETR, and the combination with related DETR variants further improves the performance.
