Table of Contents
Fetching ...

DFMSD: Dual Feature Masking Stage-wise Knowledge Distillation for Object Detection

Zhourui Zhang, Jun Li, Zhijian Wu, Jifeng Shen, Jianhua Xu

TL;DR

A novel dual feature-masking heterogeneous distillation framework termed DFMSD is proposed for object detection, and the promise of this approach is demonstrated, suggesting that DFMSD outperforms both the state-of-the-art heterogeneous and homogeneous distillation methods.

Abstract

In recent years, current mainstream feature masking distillation methods mainly function by reconstructing selectively masked regions of a student network from the feature maps of a teacher network. In these methods, attention mechanisms can help to identify spatially important regions and crucial object-aware channel clues, such that the reconstructed features are encoded with sufficient discriminative and representational power similar to teacher features. However, previous feature-masking distillation methods mainly address homogeneous knowledge distillation without fully taking into account the heterogeneous knowledge distillation scenario. In particular, the huge discrepancy between the teacher and the student frameworks within the heterogeneous distillation paradigm is detrimental to feature masking, leading to deteriorating reconstructed student features. In this study, a novel dual feature-masking heterogeneous distillation framework termed DFMSD is proposed for object detection. More specifically, a stage-wise adaptation learning module is incorporated into the dual feature-masking framework, and thus the student model can be progressively adapted to the teacher models for bridging the gap between heterogeneous networks. Furthermore, a masking enhancement strategy is combined with stage-wise learning such that object-aware masking regions are adaptively strengthened to improve feature-masking reconstruction. In addition, semantic alignment is performed at each Feature Pyramid Network (FPN) layer between the teacher and the student networks for generating consistent feature distributions. Our experiments for the object detection task demonstrate the promise of our approach, suggesting that DFMSD outperforms both the state-of-the-art heterogeneous and homogeneous distillation methods.

DFMSD: Dual Feature Masking Stage-wise Knowledge Distillation for Object Detection

TL;DR

A novel dual feature-masking heterogeneous distillation framework termed DFMSD is proposed for object detection, and the promise of this approach is demonstrated, suggesting that DFMSD outperforms both the state-of-the-art heterogeneous and homogeneous distillation methods.

Abstract

In recent years, current mainstream feature masking distillation methods mainly function by reconstructing selectively masked regions of a student network from the feature maps of a teacher network. In these methods, attention mechanisms can help to identify spatially important regions and crucial object-aware channel clues, such that the reconstructed features are encoded with sufficient discriminative and representational power similar to teacher features. However, previous feature-masking distillation methods mainly address homogeneous knowledge distillation without fully taking into account the heterogeneous knowledge distillation scenario. In particular, the huge discrepancy between the teacher and the student frameworks within the heterogeneous distillation paradigm is detrimental to feature masking, leading to deteriorating reconstructed student features. In this study, a novel dual feature-masking heterogeneous distillation framework termed DFMSD is proposed for object detection. More specifically, a stage-wise adaptation learning module is incorporated into the dual feature-masking framework, and thus the student model can be progressively adapted to the teacher models for bridging the gap between heterogeneous networks. Furthermore, a masking enhancement strategy is combined with stage-wise learning such that object-aware masking regions are adaptively strengthened to improve feature-masking reconstruction. In addition, semantic alignment is performed at each Feature Pyramid Network (FPN) layer between the teacher and the student networks for generating consistent feature distributions. Our experiments for the object detection task demonstrate the promise of our approach, suggesting that DFMSD outperforms both the state-of-the-art heterogeneous and homogeneous distillation methods.
Paper Structure (33 sections, 8 equations, 10 figures, 11 tables)

This paper contains 33 sections, 8 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Comparison of activation maps of heterogeneous detectors evaluated on image "hummingbird". Regarding our proposed DFMSD method, we utilize one-stage detector RetinaNet as the student and two-stage detector Faster R-CNN as the teacher. The qualitative results demonstrate significant variances in object-aware perception capability of different detectors characterized by the map intensity. More importantly, our method highlights more object-aware regions with higher intensity and the resulting feature map contains the most discriminative information.
  • Figure 2: Our proposed DFMSD distillation framework. Following the dual-masked knowledge distillation (DMKD) framework where both spatially salient regions and informative channels are identified, a stage-wise adaptive learning strategy (SAL) is integrated, allowing the student network to progressively learning from different heterogeneous teacher networks with improved adaptability. Simultaneously, a masking enhancement module is incorporated into the SAL so that the object-aware masking regions are enhanced for improving masking feature reconstruction. In addition, the semantic feature alignment is performed at each FPN layer between teacher and student backbones, producing consistent feature distribution for further bridging the teacher-student gap.
  • Figure 3: Illustration of our SAL mechanism for adaptively improving the distillation performance. With Swin Transformer r24 and Faster R-CNN r10 used as respective teacher and the student networks, two-stage SAL mechanism firstly improves the Faster R-CNN detector from 38.4% to 42.2% with a Swin-Transformer-T model r24, and then further boosts the student performance to 42.9% with a more powerful Swin-Transformer-S detector r24. In contrast, the conventional one-stage distillation approach only improves the Faster R-CNN model to 42.3% accuracy which is roughly the first-stage distillation performance within SAL.
  • Figure 4: Comparison of the student feature maps in different distillation stages of our SAL mechanism. (a) and (b) demonstrate two feature maps obtained by the original teacher and student models, respectively. (c) and (d) illustrate the student feature maps generated after the first-stage and the second-stage distillation respectively. It can be observed that distinct object-aware regions can be captured after consecutive distillation stages, yielding sufficiently discriminative feature maps close to the teacher counterparts.
  • Figure 5: Comparison of object-aware candidate boxes and region-aware attention score distribution in different frequency domains achieved by RetinaNet detector. It can be clearly observed that the RetinaNet misses some small objects including the football and the far-end partially occluded referee in black even in the high-frequency domain of the image. This can also be demonstrated in the region-specific attention maps where the corresponding regions are low-scored. However, with the help of our adaptive augmentation strategy, the importance of the high-frequency regions corresponding to the small objects is promoted with increased attention scores highlighted in (f), which is beneficial for the subsequent feature masking and reconstruction.
  • ...and 5 more figures