Knowledge Distillation via the Target-aware Transformer

Sihao Lin; Hongwei Xie; Bing Wang; Kaicheng Yu; Xiaojun Chang; Xiaodan Liang; Gang Wang

Knowledge Distillation via the Target-aware Transformer

Sihao Lin, Hongwei Xie, Bing Wang, Kaicheng Yu, Xiaojun Chang, Xiaodan Liang, Gang Wang

TL;DR

The paper tackles semantic mismatch in knowledge distillation when transferring from large teachers to smaller students by proposing a target-aware transformer (TaT) that enables one-to-all spatial distillation, reconfiguring the student features conditioned on teacher components. It introduces a hierarchical distillation framework with patch-group and anchor-point distillation to manage computational complexity and capture local as well as global dependencies. Empirical results on ImageNet, Pascal VOC, and COCOStuff10k show significant gains over state-of-the-art KD methods, including substantial improvements for compact architectures and segmentation tasks. The approach offers a practical path to stronger, more transferable representations in resource-constrained models, with potential extensions to multi-layer distillation and other vision tasks in future work.

Abstract

Knowledge distillation becomes a de facto standard to improve the performance of small neural networks. Most of the previous works propose to regress the representational features from the teacher to the student in a one-to-one spatial matching fashion. However, people tend to overlook the fact that, due to the architecture differences, the semantic information on the same spatial location usually vary. This greatly undermines the underlying assumption of the one-to-one distillation approach. To this end, we propose a novel one-to-all spatial matching knowledge distillation approach. Specifically, we allow each pixel of the teacher feature to be distilled to all spatial locations of the student features given its similarity, which is generated from a target-aware transformer. Our approach surpasses the state-of-the-art methods by a significant margin on various computer vision benchmarks, such as ImageNet, Pascal VOC and COCOStuff10k. Code is available at https://github.com/sihaoevery/TaT.

Knowledge Distillation via the Target-aware Transformer

TL;DR

Abstract

Paper Structure (15 sections, 10 equations, 4 figures, 14 tables)

This paper contains 15 sections, 10 equations, 4 figures, 14 tables.

Introduction
Related Works
Method
Formulation
Hierarchical Distillation
Patch-group Distillation
Anchor-point Distillation
Experiment
Datasets
Implementation Details
Image Classification
Semantic Segmentation
Ablation Study
Conclusion
Discussion

Figures (4)

Figure 1: Illustration of semantic mismatch. Suppose that teacher and student are the 3-layers and 2-layers convnets with kernel size $3\times 3$ and stride $1\times 1$. (a) shows the receptive field of the middle pixel of the final feature map, where the blue box represents the teacher's receptive field and the orange box is that of the student's. Since teacher model has more convolutional operations, the resulting teacher feature map has a larger receptive field and thus contains richer semantic information. (b) Hence, directly regressing the student's and teacher's feature in a one-to-one spatial matching fashion may be suboptimal. (c) We proposed a one-to-all knowledge distillation via a target-aware transformer that can let the teacher's spatial components be distilled to the entire student feature maps.
Figure 2: Illustration of our framework. (a) Target-aware Transformer. Conditioned on the teacher feature and the student feature, the transformation map Corr. is computed and then applied on the student feature to reconfigure itself, which is then asked to minimize the L$_2$ loss with the corresponding teacher feature. (b) Patch-group Distillation. Both teacher and student features are to be sliced and rearranged as groups for distillation. By concatenating the patches within a group, we explicitly introduce the spatial correlation among the patches beyond the patches themselves. (c) Anchor-point Distillation. Each color indicates a region. We use average pooling to extract the anchor within a local area of the given feature map, forming the new feature map of a smaller size. The generated anchor-point features will participate in the distillation.
Figure 3: The performance of our model under different $\epsilon$ on ImageNet. Here the loss $\mathcal{L}_{\rm{KL}}$ is removed and $\alpha$ is set to 0.1.
Figure 4: Visualization of feature map and TaT map. The input is selected from ImageNet validation set. The teacher backbone is ResNet34 and student backbone is ResNet18. The feature map of the distillation layer (4-th block) has been visualized. While there are 512 feature channels in total, we visualize 64 channels for better visualization. Through the Target-aware transformer, we found that the reconfigured student feature (3rd column) has a similar pattern with teacher feature (4th column). The associated TaT map has also been visualized, which indicates the student would aggregate the semantic mostly from neighbor to enhance its pixels.

Knowledge Distillation via the Target-aware Transformer

TL;DR

Abstract

Knowledge Distillation via the Target-aware Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (4)