Align-DETR: Enhancing End-to-end Object Detection with Aligned Loss
Zhi Cai, Songtao Liu, Guodong Wang, Zheng Ge, Xiangyu Zhang, Di Huang
TL;DR
Align-DETR tackles two core misalignment issues in DETR: (1) misalignment between classification confidence and localization accuracy, and (2) misalignment of training targets across decoder layers. It introduces Align Loss, a regression-aware, soft-target loss, and a many-to-one, ranking-based matching strategy with exponential down-weighting to smoothly transition from positives to negatives and stabilize cross-layer supervision. Together, these components align classification and regression and stabilize intermediate targets, yielding significant improvements on COCO (e.g., 50.5 AP 1x, 51.7 AP 2x) and surpassing several strong DETR variants with fewer tricks. The approach offers a simple, effective enhancement to end-to-end object detection, with strong empirical gains and practical scalability across schedules and backbones.
Abstract
DETR has set up a simple end-to-end pipeline for object detection by formulating this task as a set prediction problem, showing promising potential. Despite its notable advancements, this paper identifies two key forms of misalignment within the model: classification-regression misalignment and cross-layer target misalignment. Both issues impede DETR's convergence and degrade its overall performance. To tackle both issues simultaneously, we introduce a novel loss function, termed as Align Loss, designed to resolve the discrepancy between the two tasks. Align Loss guides the optimization of DETR through a joint quality metric, strengthening the connection between classification and regression. Furthermore, it incorporates an exponential down-weighting term to facilitate a smooth transition from positive to negative samples. Align-DETR also employs many-to-one matching for supervision of intermediate layers, akin to the design of H-DETR, which enhances robustness against instability. We conducted extensive experiments, yielding highly competitive results. Notably, our method achieves a 49.3% (+0.6) AP on the H-DETR baseline with the ResNet-50 backbone. It also sets a new state-of-the-art performance, reaching 50.5% AP in the 1x setting and 51.7% AP in the 2x setting, surpassing several strong competitors. Our code is available at https://github.com/FelixCaae/AlignDETR.
