Table of Contents
Fetching ...

Wasserstein Distance Rivals Kullback-Leibler Divergence for Knowledge Distillation

Jiaming Lv, Haoyuan Yang, Peihua Li

TL;DR

The paper tackles limitations of KL-Div in knowledge distillation by introducing Wasserstein Distance based KD (WKD), which enables cross-category comparisons for logits and geometry-aware distribution matching for intermediate features. It presents WKD-L, a discrete WD formulation for logit distillation with target separation and inter-category relations modeled via CK A, and WKD-F, a continuous WD approach that distills Gaussian feature distributions from intermediate layers. Across ImageNet, CIFAR-100, self-KD, and MS-COCO, WKD-L and WKD-F outperform strong KL-Div-based methods and even state-of-the-art WD-based approaches, with their combination yielding the best results. The work demonstrates the practical viability of WD in distillation, highlights the importance of category interrelations and manifold geometry, and points to future directions in robust WD computation and distribution modeling for KD.

Abstract

Since pioneering work of Hinton et al., knowledge distillation based on Kullback-Leibler Divergence (KL-Div) has been predominant, and recently its variants have achieved compelling performance. However, KL-Div only compares probabilities of the corresponding category between the teacher and student while lacking a mechanism for cross-category comparison. Besides, KL-Div is problematic when applied to intermediate layers, as it cannot handle non-overlapping distributions and is unaware of geometry of the underlying manifold. To address these downsides, we propose a methodology of Wasserstein Distance (WD) based knowledge distillation. Specifically, we propose a logit distillation method called WKD-L based on discrete WD, which performs cross-category comparison of probabilities and thus can explicitly leverage rich interrelations among categories. Moreover, we introduce a feature distillation method called WKD-F, which uses a parametric method for modeling feature distributions and adopts continuous WD for transferring knowledge from intermediate layers. Comprehensive evaluations on image classification and object detection have shown (1) for logit distillation WKD-L outperforms very strong KL-Div variants; (2) for feature distillation WKD-F is superior to the KL-Div counterparts and state-of-the-art competitors. The source code is available at https://peihuali.org/WKD

Wasserstein Distance Rivals Kullback-Leibler Divergence for Knowledge Distillation

TL;DR

The paper tackles limitations of KL-Div in knowledge distillation by introducing Wasserstein Distance based KD (WKD), which enables cross-category comparisons for logits and geometry-aware distribution matching for intermediate features. It presents WKD-L, a discrete WD formulation for logit distillation with target separation and inter-category relations modeled via CK A, and WKD-F, a continuous WD approach that distills Gaussian feature distributions from intermediate layers. Across ImageNet, CIFAR-100, self-KD, and MS-COCO, WKD-L and WKD-F outperform strong KL-Div-based methods and even state-of-the-art WD-based approaches, with their combination yielding the best results. The work demonstrates the practical viability of WD in distillation, highlights the importance of category interrelations and manifold geometry, and points to future directions in robust WD computation and distribution modeling for KD.

Abstract

Since pioneering work of Hinton et al., knowledge distillation based on Kullback-Leibler Divergence (KL-Div) has been predominant, and recently its variants have achieved compelling performance. However, KL-Div only compares probabilities of the corresponding category between the teacher and student while lacking a mechanism for cross-category comparison. Besides, KL-Div is problematic when applied to intermediate layers, as it cannot handle non-overlapping distributions and is unaware of geometry of the underlying manifold. To address these downsides, we propose a methodology of Wasserstein Distance (WD) based knowledge distillation. Specifically, we propose a logit distillation method called WKD-L based on discrete WD, which performs cross-category comparison of probabilities and thus can explicitly leverage rich interrelations among categories. Moreover, we introduce a feature distillation method called WKD-F, which uses a parametric method for modeling feature distributions and adopts continuous WD for transferring knowledge from intermediate layers. Comprehensive evaluations on image classification and object detection have shown (1) for logit distillation WKD-L outperforms very strong KL-Div variants; (2) for feature distillation WKD-F is superior to the KL-Div counterparts and state-of-the-art competitors. The source code is available at https://peihuali.org/WKD

Paper Structure

This paper contains 36 sections, 17 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Our methodology of Wasserstein Distance (WD) based knowledge distillation. To effectively exploit rich category interrelations (a), we propose discrete WD based logit distillation (WKD-L) (b) that matches predicted distributions between the teacher and student. Besides, we introduce a feature distillation method based on continuous WD (WKD-F) (b), where we let student mimic parametric feature distributions of the teacher. In (a), features of 100 categories are displayed by the corresponding images as per their 2D embeddings obtained by t-SNE; refer to Section \ref{['Subsection: Details of IR Method']} for details on this visualization.
  • Figure 2: KL-Div cannot perform cross-category comparison. Compare to WD in Figure \ref{['figure:KL-KD']} (left).
  • Figure 3: Diagrams of WCoRD /EMD+IPOT and NST/ICKD-C.
  • Figure 4: Visualization of interrelations among 100 categories in feature space. The categories exhibit complex topological relations, where features of the same category cluster and form a distribution that often overlaps with those of neighboring categories.
  • Figure 5: Analysis of hyper-parameters of WKD-L on ImageNet.
  • ...and 3 more figures