Align-DETR: Enhancing End-to-end Object Detection with Aligned Loss

Zhi Cai; Songtao Liu; Guodong Wang; Zheng Ge; Xiangyu Zhang; Di Huang

Align-DETR: Enhancing End-to-end Object Detection with Aligned Loss

Zhi Cai, Songtao Liu, Guodong Wang, Zheng Ge, Xiangyu Zhang, Di Huang

TL;DR

Align-DETR tackles two core misalignment issues in DETR: (1) misalignment between classification confidence and localization accuracy, and (2) misalignment of training targets across decoder layers. It introduces Align Loss, a regression-aware, soft-target loss, and a many-to-one, ranking-based matching strategy with exponential down-weighting to smoothly transition from positives to negatives and stabilize cross-layer supervision. Together, these components align classification and regression and stabilize intermediate targets, yielding significant improvements on COCO (e.g., 50.5 AP 1x, 51.7 AP 2x) and surpassing several strong DETR variants with fewer tricks. The approach offers a simple, effective enhancement to end-to-end object detection, with strong empirical gains and practical scalability across schedules and backbones.

Abstract

DETR has set up a simple end-to-end pipeline for object detection by formulating this task as a set prediction problem, showing promising potential. Despite its notable advancements, this paper identifies two key forms of misalignment within the model: classification-regression misalignment and cross-layer target misalignment. Both issues impede DETR's convergence and degrade its overall performance. To tackle both issues simultaneously, we introduce a novel loss function, termed as Align Loss, designed to resolve the discrepancy between the two tasks. Align Loss guides the optimization of DETR through a joint quality metric, strengthening the connection between classification and regression. Furthermore, it incorporates an exponential down-weighting term to facilitate a smooth transition from positive to negative samples. Align-DETR also employs many-to-one matching for supervision of intermediate layers, akin to the design of H-DETR, which enhances robustness against instability. We conducted extensive experiments, yielding highly competitive results. Notably, our method achieves a 49.3% (+0.6) AP on the H-DETR baseline with the ResNet-50 backbone. It also sets a new state-of-the-art performance, reaching 50.5% AP in the 1x setting and 51.7% AP in the 2x setting, surpassing several strong competitors. Our code is available at https://github.com/FelixCaae/AlignDETR.

Align-DETR: Enhancing End-to-end Object Detection with Aligned Loss

TL;DR

Abstract

Paper Structure (14 sections, 7 equations, 2 figures, 5 tables)

This paper contains 14 sections, 7 equations, 2 figures, 5 tables.

Introduction
Related Work
Label Assignment in Object Detection
End-to-end Object Detection
Method
Preliminaries
Motivation and Framework
Align-DETR
Experiments
Setup
Main Results
Comparison with Related Methods
Ablation Study
Conclusion

Figures (2)

Figure 1: Left: Intersection over Union (IoU) distribution of two types of samples. There is a notable gap between best regressed samples (oracle) and the high confident samples, indicating a discrepancy between these two tasks. Right: The convergence curve of Align-DETR and DINO where Align-DETR converges faster significantly.
Figure 2: The architecture overview of the proposed approach Align-DETR . Align-DETR adopts many-to-one matching where each GT is assigned multiple queries. These queries are sorted according to their quality . Then, we compute an alignment score for each query according to their rank, classification confidence and IoU with the GT. The alignment score is used in the loss computation for both classification and regression.

Align-DETR: Enhancing End-to-end Object Detection with Aligned Loss

TL;DR

Abstract

Align-DETR: Enhancing End-to-end Object Detection with Aligned Loss

Authors

TL;DR

Abstract

Table of Contents

Figures (2)