Table of Contents
Fetching ...

CrossKD: Cross-Head Knowledge Distillation for Object Detection

Jiabao Wang, Yuming Chen, Zhaohui Zheng, Xiang Li, Ming-Ming Cheng, Qibin Hou

TL;DR

CrossKD tackles target conflicts in object detection KD by transferring a student’s intermediate head features into the teacher head to generate cross-head predictions, which are then distilled to the teacher’s outputs. This decouples the student’s detection loss from the distillation signal, stabilizing training and delivering task-focused supervision. Empirical results on MS COCO show state-of-the-art gains, with 43.7 AP on GFL-ResNet-50 using a 1x schedule and strong improvements when combined with PKD or applied to heterogeneous backbones. The method demonstrates robust generalization across detector architectures (RetinaNet, FCOS, ATSS) and backbones, and even extends to Faster R-CNN and Deformable DETR, indicating wide applicability and practical impact in model compression for dense detection tasks.

Abstract

Knowledge Distillation (KD) has been validated as an effective model compression technique for learning compact object detectors. Existing state-of-the-art KD methods for object detection are mostly based on feature imitation. In this paper, we present a general and effective prediction mimicking distillation scheme, called CrossKD, which delivers the intermediate features of the student's detection head to the teacher's detection head. The resulting cross-head predictions are then forced to mimic the teacher's predictions. This manner relieves the student's head from receiving contradictory supervision signals from the annotations and the teacher's predictions, greatly improving the student's detection performance. Moreover, as mimicking the teacher's predictions is the target of KD, CrossKD offers more task-oriented information in contrast with feature imitation. On MS COCO, with only prediction mimicking losses applied, our CrossKD boosts the average precision of GFL ResNet-50 with 1x training schedule from 40.2 to 43.7, outperforming all existing KD methods. In addition, our method also works well when distilling detectors with heterogeneous backbones. Code is available at https://github.com/jbwang1997/CrossKD.

CrossKD: Cross-Head Knowledge Distillation for Object Detection

TL;DR

CrossKD tackles target conflicts in object detection KD by transferring a student’s intermediate head features into the teacher head to generate cross-head predictions, which are then distilled to the teacher’s outputs. This decouples the student’s detection loss from the distillation signal, stabilizing training and delivering task-focused supervision. Empirical results on MS COCO show state-of-the-art gains, with 43.7 AP on GFL-ResNet-50 using a 1x schedule and strong improvements when combined with PKD or applied to heterogeneous backbones. The method demonstrates robust generalization across detector architectures (RetinaNet, FCOS, ATSS) and backbones, and even extends to Faster R-CNN and Deformable DETR, indicating wide applicability and practical impact in model compression for dense detection tasks.

Abstract

Knowledge Distillation (KD) has been validated as an effective model compression technique for learning compact object detectors. Existing state-of-the-art KD methods for object detection are mostly based on feature imitation. In this paper, we present a general and effective prediction mimicking distillation scheme, called CrossKD, which delivers the intermediate features of the student's detection head to the teacher's detection head. The resulting cross-head predictions are then forced to mimic the teacher's predictions. This manner relieves the student's head from receiving contradictory supervision signals from the annotations and the teacher's predictions, greatly improving the student's detection performance. Moreover, as mimicking the teacher's predictions is the target of KD, CrossKD offers more task-oriented information in contrast with feature imitation. On MS COCO, with only prediction mimicking losses applied, our CrossKD boosts the average precision of GFL ResNet-50 with 1x training schedule from 40.2 to 43.7, outperforming all existing KD methods. In addition, our method also works well when distilling detectors with heterogeneous backbones. Code is available at https://github.com/jbwang1997/CrossKD.
Paper Structure (21 sections, 6 equations, 8 figures, 12 tables)

This paper contains 21 sections, 6 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Comparisons between conventional KD methods and our CrossKD. Rather than explicitly enforcing the consistency between the intermediate feature maps or the predictions of the teacher-student pair, CrossKD implicitly builds the connection between the heads of the teacher-student pair to improve the distillation efficiency.
  • Figure 2: Visualizations of the classification predictions from the GFL li2020generalized. (a) and (b) are ground truth and distillation targets. (c) and (d) are the classification outputs predicted by models training with conventional prediction mimicking and proposed CrossKD. In the green circled areas, the distillation targets predicted by the teacher have a large discrepancy with the ground-truth targets assigned to the student. prediction mimicking forces the student to mimic the teacher, while CrossKD can smooth the mimicking process.
  • Figure 3: Statistics of the target conflict degree between student (GFL-R50) and teacher (GFL-R101, ATSS-R101, RetinaNet-R101). X-axis is the teacher-student discrepancy threshold for conflict areas. Y-axis represents the ratios of the target conflict areas to the positive areas.
  • Figure 4: Overall framework of the proposed CrossKD. For a given teacher-student pair, CrossKD first delivers the intermediate features of the student into the teacher layers and generates the cross-head predictions $\hat{\bm{p}}^s$. Then, distillation losses are calculated between the original teacher's predictions and the cross-head predictions of the student. In back-propagation, the gradients with respect to the detection loss normally pass through the student detection head, while the distillation gradients propagate through the frozen teacher layers.
  • Figure 5: Visualizations of the gradients w.r.t feature imitation and CrossKD. The visualization demonstrates that our CrossKD guided by prediction mimicking can effectively focus on the potentially valuable regions.
  • ...and 3 more figures