DETRs with Collaborative Hybrid Assignments Training

Zhuofan Zong; Guanglu Song; Yu Liu

DETRs with Collaborative Hybrid Assignments Training

Zhuofan Zong, Guanglu Song, Yu Liu

TL;DR

DETR-based detectors suffer from sparse encoder supervision and limited decoder attention due to one-to-one matching. The authors present Co-DETR, a collaborative hybrid assignments training scheme that employs multiple auxiliary heads supervised by diverse one-to-many label assignments and generates customized positive queries to strengthen both encoder learning and decoder cross-attention, while keeping inference cost unchanged. Across DETR variants and backbones, Co-DETR yields consistent AP gains and achieves state-of-the-art performance on COCO test-dev with ViT-L backbones and strong LVIS results, demonstrating scalable improvements with modest training overhead. This approach shows that combining complementary label-assignment strategies during training can substantially enhance end-to-end DETR detectors without adding inference-time complexity.

Abstract

In this paper, we provide the observation that too few queries assigned as positive samples in DETR with one-to-one set matching leads to sparse supervision on the encoder's output which considerably hurt the discriminative feature learning of the encoder and vice visa for attention learning in the decoder. To alleviate this, we present a novel collaborative hybrid assignments training scheme, namely $\mathcal{C}$o-DETR, to learn more efficient and effective DETR-based detectors from versatile label assignment manners. This new training scheme can easily enhance the encoder's learning ability in end-to-end detectors by training the multiple parallel auxiliary heads supervised by one-to-many label assignments such as ATSS and Faster RCNN. In addition, we conduct extra customized positive queries by extracting the positive coordinates from these auxiliary heads to improve the training efficiency of positive samples in the decoder. In inference, these auxiliary heads are discarded and thus our method introduces no additional parameters and computational cost to the original detector while requiring no hand-crafted non-maximum suppression (NMS). We conduct extensive experiments to evaluate the effectiveness of the proposed approach on DETR variants, including DAB-DETR, Deformable-DETR, and DINO-Deformable-DETR. The state-of-the-art DINO-Deformable-DETR with Swin-L can be improved from 58.5% to 59.5% AP on COCO val. Surprisingly, incorporated with ViT-L backbone, we achieve 66.0% AP on COCO test-dev and 67.9% AP on LVIS val, outperforming previous methods by clear margins with much fewer model sizes. Codes are available at \url{https://github.com/Sense-X/Co-DETR}.

DETRs with Collaborative Hybrid Assignments Training

TL;DR

Abstract

o-DETR, to learn more efficient and effective DETR-based detectors from versatile label assignment manners. This new training scheme can easily enhance the encoder's learning ability in end-to-end detectors by training the multiple parallel auxiliary heads supervised by one-to-many label assignments such as ATSS and Faster RCNN. In addition, we conduct extra customized positive queries by extracting the positive coordinates from these auxiliary heads to improve the training efficiency of positive samples in the decoder. In inference, these auxiliary heads are discarded and thus our method introduces no additional parameters and computational cost to the original detector while requiring no hand-crafted non-maximum suppression (NMS). We conduct extensive experiments to evaluate the effectiveness of the proposed approach on DETR variants, including DAB-DETR, Deformable-DETR, and DINO-Deformable-DETR. The state-of-the-art DINO-Deformable-DETR with Swin-L can be improved from 58.5% to 59.5% AP on COCO val. Surprisingly, incorporated with ViT-L backbone, we achieve 66.0% AP on COCO test-dev and 67.9% AP on LVIS val, outperforming previous methods by clear margins with much fewer model sizes. Codes are available at \url{https://github.com/Sense-X/Co-DETR}.

Paper Structure (16 sections, 12 equations, 10 figures, 13 tables)

This paper contains 16 sections, 12 equations, 10 figures, 13 tables.

Introduction
Related Works
Method
Overview
Collaborative Hybrid Assignments Training
Customized Positive Queries Generation
Why Co-DETR works
Comparison with other methods
Experiments
Setup
Main Results
Comparisons with the state-of-the-art
Ablation Studies
Conclusions
More ablation studies
...and 1 more sections

Figures (10)

Figure 1: Performance of models with ResNet-50 on COCO val. $\mathcal{C}$o-DETR outperforms other counterparts by a large margin.
Figure 2: IoF-IoB curves for the feature discriminability score in the encoder and attention discriminability score in the decoder.
Figure 3: Visualizations of discriminability scores in the encoder.
Figure 4: Framework of our Collaborative Hybrid Assignment Training. The auxiliary branches are discarded during evaluation.
Figure 5: The instability (IS) dn of Deformable-DETR and $\mathcal{C}$o-Deformable-DETR on COCO dataset. These detectors are trained for 12 epochs with ResNet-50 backbones.
...and 5 more figures

DETRs with Collaborative Hybrid Assignments Training

TL;DR

Abstract

DETRs with Collaborative Hybrid Assignments Training

Authors

TL;DR

Abstract

Table of Contents

Figures (10)