Why does Knowledge Distillation Work? Rethink its Attention and Fidelity Mechanism

Chenqi Guo; Shiwei Zhong; Xiaofeng Liu; Qianli Feng; Yinglong Ma

Why does Knowledge Distillation Work? Rethink its Attention and Fidelity Mechanism

Chenqi Guo, Shiwei Zhong, Xiaofeng Liu, Qianli Feng, Yinglong Ma

TL;DR

This work addresses why Knowledge Distillation (KD) works by challenging the assumption that high fidelity between student and teachers is always beneficial. It introduces the idea that diversity in teacher attentions, promoted by stronger data augmentation, provides a broader supervisory perspective that improves generalization even as fidelity and mutual information decline. By using IoU of attention maps and an Affinity augmentation-strength metric, the paper demonstrates a causal link between attention diversity and improved student performance, and shows that logits-matching optimization mainly aids generalization at the cost of fidelity. The findings suggest practical KD design principles that leverage heterogeneous teacher signals and moderated mimicry to achieve robust generalization in vision tasks, including ensembles and transformers.

Abstract

Does Knowledge Distillation (KD) really work? Conventional wisdom viewed it as a knowledge transfer procedure where a perfect mimicry of the student to its teacher is desired. However, paradoxical studies indicate that closely replicating the teacher's behavior does not consistently improve student generalization, posing questions on its possible causes. Confronted with this gap, we hypothesize that diverse attentions in teachers contribute to better student generalization at the expense of reduced fidelity in ensemble KD setups. By increasing data augmentation strengths, our key findings reveal a decrease in the Intersection over Union (IoU) of attentions between teacher models, leading to reduced student overfitting and decreased fidelity. We propose this low-fidelity phenomenon as an underlying characteristic rather than a pathology when training KD. This suggests that stronger data augmentation fosters a broader perspective provided by the divergent teacher ensemble and lower student-teacher mutual information, benefiting generalization performance. These insights clarify the mechanism on low-fidelity phenomenon in KD. Thus, we offer new perspectives on optimizing student model performance, by emphasizing increased diversity in teacher attentions and reduced mimicry behavior between teachers and student.

Why does Knowledge Distillation Work? Rethink its Attention and Fidelity Mechanism

TL;DR

Abstract

Paper Structure (21 sections, 13 equations, 9 figures, 6 tables)

This paper contains 21 sections, 13 equations, 9 figures, 6 tables.

Introduction
Related Works
Problem and Hypothesis
Evaluation Metrics
IoU in Attention Maps
Model Dependency in KD
Quantify Data Augmentation Strength within Ensemble KD
Experimental Setup
Results and Analysis
Impact on Attention Map Diversity
Revisiting the Role of Fidelity and Mutual Information
Effects of Logits Matching Optimization on KD
Conclusion
Detailed Experimental Settings
Fidelity with KL divergence Measurement
...and 6 more sections

Figures (9)

Figure 1: Left: Attention map visualizations for teacher ensembles and student model in Knowledge Distillation (KD) on ImageNet dataset. Stronger data augmentation ($\text{T}_{1w}\text{T}_{2s}\text{S}_s$ and $\text{T}_{1s}\text{T}_{2w}\text{S}_s$ in this case) as measured by Affinity improves teachers' attentional divergence, thus providing the student a more comprehensive perspective on the overall characteristics of the target images, leading to a better generalization ability. Middle and Right: Scatter plots of Intersection over Union (IoU) in Attention maps, and Fidelity between teacher ensembles and student during KD training. The decreasing tendency in fidelity challenges the conventional wisdom that higher fidelity consistently correlate with better student performance. Later we will demonstrate that the low-fidelity observation is caused by attention map diversification existed within teacher ensembles, and even optimization towards logits-matching can hardly mitigate this low-fidelity effect.
Figure 2: Top: Scatter plots of IoU between $\text{T}_1$ and $\text{T}_2$ attention maps during KD training. Bottom: Exampled attention maps of $\text{T}_1$, $\text{T}_2$ and S. This attention divergence among teacher ensembles, attributed to the randomness injected by data augmentation, gives the student distilled on them a more comprehensive perspective.
Figure 3: Scatter plots of Top: Fidelity (measured by top-1 A) and Bottom: Mutual Information (MI) between teacher ensembles and student during KD training. These decreasing tendencies along with the improved student validation accuracy are in contrast to the traditional viewpoint that higher fidelity consistently benefits student performance, indicating that some extent of student independency may be desired during KD training.
Figure 4: Bar plots comparing between vanilla KD and z-score standardization KD. Top: Generalization performance in terms of train-validation accuracy gap. bottom: Student-teacher fidelity. The z-score standardization, aimed at facilitating the student-teacher logits matching procedure, does improve student generalization performance (indicated by a lower accuracy gap) in most cases. However, it also leads to a decrease in student-teacher fidelity during training, suggesting that the benefit lies more in student generalization than in fidelity improvement.
Figure B.5: Scatter plots of fidelity (measured by KL divergence) between teacher ensembles and student during KD training. For KL divergence, a higher value implies lower fidelity. Thus, these increasing tendencies align with the decreasing ones with Top-1 A in the main text.
...and 4 more figures

Why does Knowledge Distillation Work? Rethink its Attention and Fidelity Mechanism

TL;DR

Abstract

Why does Knowledge Distillation Work? Rethink its Attention and Fidelity Mechanism

Authors

TL;DR

Abstract

Table of Contents

Figures (9)