Table of Contents
Fetching ...

MS-DETR: Efficient DETR Training with Mixed Supervision

Chuyang Zhao, Yifan Sun, Wenhao Wang, Qiang Chen, Errui Ding, Yi Yang, Jingdong Wang

TL;DR

MS-DETR addresses DETR's training inefficiency by introducing mixed supervision that adds a one-to-many objective to the primary decoder alongside the standard one-to-one supervision. The method defines a one-to-many loss with top-$K_n$ matching and a combined matching score to enrich object query representations without adding decoder branches. Empirically, MS-DETR delivers consistent gains across multiple DETR baselines, accelerates convergence, and remains memory-efficient, with notable improvements in both object detection and instance segmentation. The approach is complementary to existing one-to-many DETR variants and yields practical benefits for training efficiency and detection quality.

Abstract

DETR accomplishes end-to-end object detection through iteratively generating multiple object candidates based on image features and promoting one candidate for each ground-truth object. The traditional training procedure using one-to-one supervision in the original DETR lacks direct supervision for the object detection candidates. We aim at improving the DETR training efficiency by explicitly supervising the candidate generation procedure through mixing one-to-one supervision and one-to-many supervision. Our approach, namely MS-DETR, is simple, and places one-to-many supervision to the object queries of the primary decoder that is used for inference. In comparison to existing DETR variants with one-to-many supervision, such as Group DETR and Hybrid DETR, our approach does not need additional decoder branches or object queries. The object queries of the primary decoder in our approach directly benefit from one-to-many supervision and thus are superior in object candidate prediction. Experimental results show that our approach outperforms related DETR variants, such as DN-DETR, Hybrid DETR, and Group DETR, and the combination with related DETR variants further improves the performance.

MS-DETR: Efficient DETR Training with Mixed Supervision

TL;DR

MS-DETR addresses DETR's training inefficiency by introducing mixed supervision that adds a one-to-many objective to the primary decoder alongside the standard one-to-one supervision. The method defines a one-to-many loss with top- matching and a combined matching score to enrich object query representations without adding decoder branches. Empirically, MS-DETR delivers consistent gains across multiple DETR baselines, accelerates convergence, and remains memory-efficient, with notable improvements in both object detection and instance segmentation. The approach is complementary to existing one-to-many DETR variants and yields practical benefits for training efficiency and detection quality.

Abstract

DETR accomplishes end-to-end object detection through iteratively generating multiple object candidates based on image features and promoting one candidate for each ground-truth object. The traditional training procedure using one-to-one supervision in the original DETR lacks direct supervision for the object detection candidates. We aim at improving the DETR training efficiency by explicitly supervising the candidate generation procedure through mixing one-to-one supervision and one-to-many supervision. Our approach, namely MS-DETR, is simple, and places one-to-many supervision to the object queries of the primary decoder that is used for inference. In comparison to existing DETR variants with one-to-many supervision, such as Group DETR and Hybrid DETR, our approach does not need additional decoder branches or object queries. The object queries of the primary decoder in our approach directly benefit from one-to-many supervision and thus are superior in object candidate prediction. Experimental results show that our approach outperforms related DETR variants, such as DN-DETR, Hybrid DETR, and Group DETR, and the combination with related DETR variants further improves the performance.
Paper Structure (11 sections, 9 equations, 7 figures, 7 tables)

This paper contains 11 sections, 9 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Mixed supervision leads to better detection candidates. Top: ground-truth box. Middle: candidate boxes from top-$20$ queries with the baseline. Bottom: candidate boxes from top-$20$ queries with our MS-DETR. One can see that MS-DETR generates better detection candidates than the baseline.
  • Figure 2: Mixed supervision leads to lower one-to-one losses than the baseline. The $x$-axis corresponds to #epochs, and the $y$-axis corresponds to the training loss from one-to-one supervision. Dashed and solid lines correspond to the loss curve of the Deformable DETR baseline and the MS-DETR, respectively. Best viewed in color.
  • Figure 3: Illustrating the architecture differences. (a) Original DETR. It is trained with one-to-one supervision. (b) Our MS-DETR. It is trained by mixing one-to-one and one-to-many supervision. The two supervisions are both imposed on the primary decoder. (c) Group DETR and DN-DETR. Additional parallel decoders are introduced, and one-to-one supervision is imposed on the additional decoders. More additional decoders are possibly used in Group DETR and DN-DETR. (d) Hybrid DETR. An additional parallel decoder is added, and one-to-many supervision is imposed on the additional decoder.
  • Figure 4: MS-DETR implementations. (a) One-to-one and one-to-many supervisions are conducted on the output object queries for each decoder layer. (b) The two supervisions are conducted on the output object queries for each decoder layer that is slightly modified: first perform cross-attention and then self-attention. (c) and (d) The one-to-many supervisions are conducted on the internal object queries. $\texttt{cls}_{\texttt{11}}$ and $\texttt{box}_{\texttt{11}}$ are class and box predictors for one-to-one supervision, and $\texttt{cls}_{\texttt{1m}}$, and $\texttt{box}_{\texttt{1m}}$ are class and box predictors for one-to-many supervision. The image features input to cross-attention are not depicted for clarity.
  • Figure 5: Influence of the hyper-parameters for one-to-many assignment. (a) Influence of $K$ for selecting top-$K$ positive queries, (b) Influence of the threshold $\tau$ to filter out low-quality queries, and (c) Influence of the matching score weight $\alpha$.
  • ...and 2 more figures