Table of Contents
Fetching ...

OD-DETR: Online Distillation for Stabilizing Training of Detection Transformer

Shengjian Wu, Li Sun, Qingli Li

TL;DR

OD-DETR introduces an EMA-based online distillation framework to stabilize DETR training and accelerate convergence. By combining matching distillation, prediction distillation, and auxiliary query groups, it leverages the EMA teacher as a continually improving source of supervision without adding parameters. The approach yields consistent AP gains across Def-DETR, DAB-Def-DETR, and DINO variants on MS-COCO and enhances training stability as measured by query-GT matching consistency. The empirical results reveal a beneficial feedback loop where a stronger EMA teacher improves the online student, and the improved student further benefits the EMA model.

Abstract

DEtection TRansformer (DETR) becomes a dominant paradigm, mainly due to its common architecture with high accuracy and no post-processing. However, DETR suffers from unstable training dynamics. It consumes more data and epochs to converge compared with CNN-based detectors. This paper aims to stabilize DETR training through the online distillation. It utilizes a teacher model, accumulated by Exponential Moving Average (EMA), and distills its knowledge into the online model in following three aspects. First, the matching relation between object queries and ground truth (GT) boxes in the teacher is employed to guide the student, so queries within the student are not only assigned labels based on their own predictions, but also refer to the matching results from the teacher. Second, the teacher's initial query is given to the online student, and its prediction is directly constrained by the corresponding output from the teacher. Finally, the object queries from teacher's different decoding stages are used to build the auxiliary groups to accelerate the convergence. For each GT, two queries with the least matching costs are selected into this extra group, and they predict the GT box and participate the optimization. Extensive experiments show that the proposed OD-DETR successfully stabilizes the training, and significantly increases the performance without bringing in more parameters.

OD-DETR: Online Distillation for Stabilizing Training of Detection Transformer

TL;DR

OD-DETR introduces an EMA-based online distillation framework to stabilize DETR training and accelerate convergence. By combining matching distillation, prediction distillation, and auxiliary query groups, it leverages the EMA teacher as a continually improving source of supervision without adding parameters. The approach yields consistent AP gains across Def-DETR, DAB-Def-DETR, and DINO variants on MS-COCO and enhances training stability as measured by query-GT matching consistency. The empirical results reveal a beneficial feedback loop where a stronger EMA teacher improves the online student, and the improved student further benefits the EMA model.

Abstract

DEtection TRansformer (DETR) becomes a dominant paradigm, mainly due to its common architecture with high accuracy and no post-processing. However, DETR suffers from unstable training dynamics. It consumes more data and epochs to converge compared with CNN-based detectors. This paper aims to stabilize DETR training through the online distillation. It utilizes a teacher model, accumulated by Exponential Moving Average (EMA), and distills its knowledge into the online model in following three aspects. First, the matching relation between object queries and ground truth (GT) boxes in the teacher is employed to guide the student, so queries within the student are not only assigned labels based on their own predictions, but also refer to the matching results from the teacher. Second, the teacher's initial query is given to the online student, and its prediction is directly constrained by the corresponding output from the teacher. Finally, the object queries from teacher's different decoding stages are used to build the auxiliary groups to accelerate the convergence. For each GT, two queries with the least matching costs are selected into this extra group, and they predict the GT box and participate the optimization. Extensive experiments show that the proposed OD-DETR successfully stabilizes the training, and significantly increases the performance without bringing in more parameters.
Paper Structure (28 sections, 7 equations, 7 figures, 11 tables)

This paper contains 28 sections, 7 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: The matching instability curves for the first 12 training epochs. We compare our OD-DETR and its EMA version with Def-DETR. The metric, introduced by li2022dn, is calculated on COCO VAL2017. Lower value means more stable matching. As is expected, EMA's instability is much lower than the online model. In OD-DETR, the online student learns from its EMA teacher, which greatly increases its stability. The improved online model also helps to stabilize the EMA's matching results.
  • Figure 2: The overall architecture of OD-DETR. In the Main Group, the EMA model's queries are given to the decoder to create Predictions Teacher, and they are matched with the GT set using Hungarian matching. Simultaneously, these queries are also input into the Online Decoder to produce Predictions Student, and they directly learn from Predictions Teacher through prediction distillation. The Online model's own queries are decoded into Predictions Online, which are also matched with the GT set. Then, through matching distillation, they refer to matching result of Predictions Teacher. In the Auxiliary Group, we select two updated queries with the lowest matching cost for each GT from the Predictions Teacher. These selections are added as an extra group to the Online Decoder.
  • Figure 3: Matching Distillation injects GTs matched in the EMA teacher into the corresponding queries in the online student. On the left, two independent matching are first carried out, assigning GTs for queries in the EMA and online models, respectively. On the right, new matches for online model are shown in gold color. For classification, this creates many-to-many matches, where queries like Q3 can match with GT2 and GT1. But for regression, to avoid confusion, each query is assigned with one GT, making a one-to-many match between GTs and queries.
  • Figure 4: Illustration for setting the multi-target label. The prediction shown here matches two objects of different classes, the dog and the bicycle, with IOUs of 0.6 and 0.8, respectively. In its label vector, two entries for dog and bicycle are set accordingly. Other category elements stay at 0.
  • Figure 5: The matching consistency between the EMA and online model. The curve shows the percentage of queries matched with the same GT during training. In both our OD-DETR and OD-DINO models, this ratio is significantly higher compared to Def-DETR and DINO.
  • ...and 2 more figures