Table of Contents
Fetching ...

Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment

Qiang Chen, Xiaokang Chen, Jian Wang, Shan Zhang, Kun Yao, Haocheng Feng, Junyu Han, Errui Ding, Gang Zeng, Jingdong Wang

TL;DR

This paper tackles the slow convergence of DETR by introducing Group DETR, which partitions object queries into K groups and performs one-to-one assignment within each group while sharing parameters across parallel decoders. The group-wise competition provides additional supervision and acts as automatic query augmentation, speeding training without altering inference. Empirical results show consistent improvements across a range of DETR variants, backbone scales, and tasks (including 3D detection and instance segmentation), with modest memory and compute overhead. The approach is simple to implement, generalizable, and preserves end-to-end detection, making it a practical acceleration for DETR-based pipelines.

Abstract

Detection transformer (DETR) relies on one-to-one assignment, assigning one ground-truth object to one prediction, for end-to-end detection without NMS post-processing. It is known that one-to-many assignment, assigning one ground-truth object to multiple predictions, succeeds in detection methods such as Faster R-CNN and FCOS. While the naive one-to-many assignment does not work for DETR, and it remains challenging to apply one-to-many assignment for DETR training. In this paper, we introduce Group DETR, a simple yet efficient DETR training approach that introduces a group-wise way for one-to-many assignment. This approach involves using multiple groups of object queries, conducting one-to-one assignment within each group, and performing decoder self-attention separately. It resembles data augmentation with automatically-learned object query augmentation. It is also equivalent to simultaneously training parameter-sharing networks of the same architecture, introducing more supervision and thus improving DETR training. The inference process is the same as DETR trained normally and only needs one group of queries without any architecture modification. Group DETR is versatile and is applicable to various DETR variants. The experiments show that Group DETR significantly speeds up the training convergence and improves the performance of various DETR-based models. Code will be available at \url{https://github.com/Atten4Vis/GroupDETR}.

Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment

TL;DR

This paper tackles the slow convergence of DETR by introducing Group DETR, which partitions object queries into K groups and performs one-to-one assignment within each group while sharing parameters across parallel decoders. The group-wise competition provides additional supervision and acts as automatic query augmentation, speeding training without altering inference. Empirical results show consistent improvements across a range of DETR variants, backbone scales, and tasks (including 3D detection and instance segmentation), with modest memory and compute overhead. The approach is simple to implement, generalizable, and preserves end-to-end detection, making it a practical acceleration for DETR-based pipelines.

Abstract

Detection transformer (DETR) relies on one-to-one assignment, assigning one ground-truth object to one prediction, for end-to-end detection without NMS post-processing. It is known that one-to-many assignment, assigning one ground-truth object to multiple predictions, succeeds in detection methods such as Faster R-CNN and FCOS. While the naive one-to-many assignment does not work for DETR, and it remains challenging to apply one-to-many assignment for DETR training. In this paper, we introduce Group DETR, a simple yet efficient DETR training approach that introduces a group-wise way for one-to-many assignment. This approach involves using multiple groups of object queries, conducting one-to-one assignment within each group, and performing decoder self-attention separately. It resembles data augmentation with automatically-learned object query augmentation. It is also equivalent to simultaneously training parameter-sharing networks of the same architecture, introducing more supervision and thus improving DETR training. The inference process is the same as DETR trained normally and only needs one group of queries without any architecture modification. Group DETR is versatile and is applicable to various DETR variants. The experiments show that Group DETR significantly speeds up the training convergence and improves the performance of various DETR-based models. Code will be available at \url{https://github.com/Atten4Vis/GroupDETR}.
Paper Structure (23 sections, 7 equations, 10 figures, 9 tables, 1 algorithm)

This paper contains 23 sections, 7 equations, 10 figures, 9 tables, 1 algorithm.

Figures (10)

  • Figure 1: Group DETR accelerates the training process for DETR variants. The training convergence curves are obtained on COCO val2017lin2014coco with ResNet-$50$he2016deep. Dashed and bold curves correspond to the baseline models and the Group DETR counterparts. Best viewed in color.
  • Figure 2: Architecture illustration. (a) Our Group DETR: group-wise one-to-many assignment and separate self-attention, architecturally equivalent to parallel decoder. (b) Group-wise one-to-many assignment only. (c) Naive one-to-many assignment. We use two groups of $4$ object queries as an example. $\mathbf{X}$: image features; $\mathbf{Y}$: predictions; $\bar{\mathbf{Y}}$: ground-truth objects, where two color boxes mean two objects and two gray boxes mean dummy objects (no objects). The color lines between $\mathbf{Y}$ and $\bar{\mathbf{Y}}$ correspond to the assignment for ground-truth objects, and the gray lines for dummy objects. For clarity, the predictors are not explicitly included.
  • Figure 3: Illustrating object queries. The predicted boxes and reference points corresponding to object queries in different groups for the same ground-truth object are plotted in different colors with one color for one group. It can be seen that these queries are spatially close and can be viewed as an augmentation of other queries. The results are from Group DETR over Conditional DETR-R50 meng2021conditional. The predicted boxes and reference points may overlap. Best view in color and zoom in.
  • Figure 4: The performance across groups of queries are similar. Only a $\pm$$0.1$ mAP is observed over the median ($37.5$ mAP). The mAP scores over the COCO val2017 are reported by a $12$-epoch trained Conditional DETR-R50 with Group DETR.
  • Figure 5: More stable assignment. The $x$-axis corresponds to #epoch, and the $y$-axis corresponds to instability score (the score is introduced by DN-DETR li2022dn, the lower the instability score, the more stable the label assignment) over COCO val2017. One can see that the assignment in Group DETR is more stable than DN-DETR and its baseline DAB-DETR.
  • ...and 5 more figures