Table of Contents
Fetching ...

Good Teachers Explain: Explanation-Enhanced Knowledge Distillation

Amin Parchami-Araghi, Moritz Böhle, Sukrut Rao, Bernt Schiele

TL;DR

This work tackles the problem that standard KD may not faithfully transfer the teacher's reasoning. It introduces explanation-enhanced KD (e^2KD), which adds a loss term L_exp that enforces similarity between teacher and student explanations in addition to matching logits, resulting in a model-agnostic framework defined by $L_{KD} = \tau^2 D_{KL}(p_T(x;\tau) || p_S(x;\tau))$ and $L = L_{KD} + \lambda L_{exp}$, where $L_{exp} = 1 - sim(E(T, x, \hat{y}_T), E(S, x, \hat{y}_T))$. The approach yields consistent improvements in accuracy and teacher-student agreement, promotes being "right for the right reasons", and preserves or transfers interpretability across architectures and tasks, including ImageNet, Waterbirds, and VOC, even under limited data and with approximate explanations via frozen explanations. Overall, e^2KD provides a simple, effective, and robust enhancement to KD with practical impact for faithful model distillation and interpretability-focused applications.

Abstract

Knowledge Distillation (KD) has proven effective for compressing large teacher models into smaller student models. While it is well known that student models can achieve similar accuracies as the teachers, it has also been shown that they nonetheless often do not learn the same function. It is, however, often highly desirable that the student's and teacher's functions share similar properties such as basing the prediction on the same input features, as this ensures that students learn the 'right features' from the teachers. In this work, we explore whether this can be achieved by not only optimizing the classic KD loss but also the similarity of the explanations generated by the teacher and the student. Despite the idea being simple and intuitive, we find that our proposed 'explanation-enhanced' KD (e$^2$KD) (1) consistently provides large gains in terms of accuracy and student-teacher agreement, (2) ensures that the student learns from the teacher to be right for the right reasons and to give similar explanations, and (3) is robust with respect to the model architectures, the amount of training data, and even works with 'approximate', pre-computed explanations.

Good Teachers Explain: Explanation-Enhanced Knowledge Distillation

TL;DR

This work tackles the problem that standard KD may not faithfully transfer the teacher's reasoning. It introduces explanation-enhanced KD (e^2KD), which adds a loss term L_exp that enforces similarity between teacher and student explanations in addition to matching logits, resulting in a model-agnostic framework defined by and , where . The approach yields consistent improvements in accuracy and teacher-student agreement, promotes being "right for the right reasons", and preserves or transfers interpretability across architectures and tasks, including ImageNet, Waterbirds, and VOC, even under limited data and with approximate explanations via frozen explanations. Overall, e^2KD provides a simple, effective, and robust enhancement to KD with practical impact for faithful model distillation and interpretability-focused applications.

Abstract

Knowledge Distillation (KD) has proven effective for compressing large teacher models into smaller student models. While it is well known that student models can achieve similar accuracies as the teachers, it has also been shown that they nonetheless often do not learn the same function. It is, however, often highly desirable that the student's and teacher's functions share similar properties such as basing the prediction on the same input features, as this ensures that students learn the 'right features' from the teachers. In this work, we explore whether this can be achieved by not only optimizing the classic KD loss but also the similarity of the explanations generated by the teacher and the student. Despite the idea being simple and intuitive, we find that our proposed 'explanation-enhanced' KD (eKD) (1) consistently provides large gains in terms of accuracy and student-teacher agreement, (2) ensures that the student learns from the teacher to be right for the right reasons and to give similar explanations, and (3) is robust with respect to the model architectures, the amount of training data, and even works with 'approximate', pre-computed explanations.
Paper Structure (31 sections, 5 equations, 10 figures, 7 tables)

This paper contains 31 sections, 5 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: KD for standard models on Waterbirds-100. We show the accuracy and agreement on in-distribution ( top) and out-of-distribution ( bottom) test samples when distilling from a ResNet-50 teacher to a ResNet-18 student with various KD approaches. Following consistency, we additionally evaluate the effectiveness of adding mixup (col. 2) and, additionally, long teaching (col. 3). We find that our proposed e$^2$KD provides significant benefits over vanilla KD, and is further enhanced under long teaching and mixup. We show the performance of prior work for reference, and find that e$^2$KD performs competitively. For results on B-cos models, see \ref{['supp:sec:quantitative:waterbirds']} and \ref{['fig:bcos-waterbirds']}.
  • Figure 2: Comparing explanations for KD on Waterbirds. Here we visualize B-cos explanations, when distilling a B-cos ResNet-50 teacher (col. 2) to a B-cos ResNet-18 student with KD (col. 3) and e$^2$KD (col. 4). While for in-distribution data (left) the different focus of the models (foreground/background) does not affect the models' predictions (correct predictions marked by ✓), it results in wrong predictions under distribution shift (right, incorrect predictions marked by ✗). For additional qualitative results, including standard models with GradCAM explanations, see \ref{['supp:sec:qualitative:waterbirds']}.
  • Figure 3: Maintaining focused explanations. We visualize B-cos explanations, when distilling a B-cos ResNet-50 teacher that has been trained to not focus on confounding input features ( col. 2), to a B-cos ResNet-18 student with KD ( col. 3) and e$^2$KD ( col. 4). Explanations of e$^2$KD students are significantly closer to the teacher's (and hence more human-aligned). Samples are drawn from the VOC test set, with all models correctly classifying the shown samples. For more qualitative results, see \ref{['supp:sec:qualitative:voc']}.
  • Figure 4: Distilling inductive biases (CNN$\rightarrow$ViT). We distill a B-cos DenseNet-169 teacher to a B-cos ViT$_\text{Tiny}$. Top-Left: e$^2$KD yields significant gains in accuracy and agreement. Bottom-Left: Cosine similarity of explanations for shifted images w.r.t. the unshifted image ($T$=0). With e$^2$KD (blue) the ViT student learns to mimic the shift periodicity of the teacher (purple), despite the inherent periodicity of 16 of the ViT architecture (seen for vanilla KD, yellow). Notably, e$^2$KD with frozen explanations yields shift-equivariant students (red), see also \ref{['sec:result:priors']} 'Distill to ViT'. Right: e$^2$KD significantly improves the explanations of the ViT model, thus maintaining the utility of the explanations of the CNN teacher model. While the explanations for KD change significantly under shift (subcol. 3), for e$^2$KD (subcol. 4), as with the CNN teacher (subcol. 2), the explanations remain consistent. See also \ref{['supp:sec:qualitative:vit']}.
  • Figure A1: In-distribution samples for distillation on biased data using the Waterbirds-100 dataset. We show explanations for both standard models (cols. 2-4) and B-cos models (cols. 5-7), given both in-distribution groups: 'Landbird on Land' (top half) and 'Waterbird on Water' (bottom half). We find that e$^2$KD approach (col. 4 and 7) is effective in preserving the teacher's focus (col. 2 and 5) to the bird instead of the background as opposed to vanilla KD (col. 3 and 6). Correct and incorrect predictions marked by ✓ and ✗ respectively.
  • ...and 5 more figures