Table of Contents
Fetching ...

Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching

Wenqi Niu, Yingchao Wang, Guohui Cai, Hanpo Hou

TL;DR

This study argues that the student model should learn not only the probability values from the teacher's output but also the relative ranking of classes, and proposes a novel Correlation Matching Knowledge Distillation (CMKD) method that combines the Pearson and Spearman correlation coefficients-based KD loss to achieve more efficient and robust distillation from a stronger teacher model.

Abstract

Knowledge Distillation (KD) has emerged as a pivotal technique for neural network compression and performance enhancement. Most KD methods aim to transfer dark knowledge from a cumbersome teacher model to a lightweight student model based on Kullback-Leibler (KL) divergence loss. However, the student performance improvements achieved through KD exhibit diminishing marginal returns, where a stronger teacher model does not necessarily lead to a proportionally stronger student model. To address this issue, we empirically find that the KL-based KD method may implicitly change the inter-class relationships learned by the student model, resulting in a more complex and ambiguous decision boundary, which in turn reduces the model's accuracy and generalization ability. Therefore, this study argues that the student model should learn not only the probability values from the teacher's output but also the relative ranking of classes, and proposes a novel Correlation Matching Knowledge Distillation (CMKD) method that combines the Pearson and Spearman correlation coefficients-based KD loss to achieve more efficient and robust distillation from a stronger teacher model. Moreover, considering that samples vary in difficulty, CMKD dynamically adjusts the weights of the Pearson-based loss and Spearman-based loss. CMKD is simple yet practical, and extensive experiments demonstrate that it can consistently achieve state-of-the-art performance on CIRAR-100 and ImageNet, and adapts well to various teacher architectures, sizes, and other KD methods.

Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching

TL;DR

This study argues that the student model should learn not only the probability values from the teacher's output but also the relative ranking of classes, and proposes a novel Correlation Matching Knowledge Distillation (CMKD) method that combines the Pearson and Spearman correlation coefficients-based KD loss to achieve more efficient and robust distillation from a stronger teacher model.

Abstract

Knowledge Distillation (KD) has emerged as a pivotal technique for neural network compression and performance enhancement. Most KD methods aim to transfer dark knowledge from a cumbersome teacher model to a lightweight student model based on Kullback-Leibler (KL) divergence loss. However, the student performance improvements achieved through KD exhibit diminishing marginal returns, where a stronger teacher model does not necessarily lead to a proportionally stronger student model. To address this issue, we empirically find that the KL-based KD method may implicitly change the inter-class relationships learned by the student model, resulting in a more complex and ambiguous decision boundary, which in turn reduces the model's accuracy and generalization ability. Therefore, this study argues that the student model should learn not only the probability values from the teacher's output but also the relative ranking of classes, and proposes a novel Correlation Matching Knowledge Distillation (CMKD) method that combines the Pearson and Spearman correlation coefficients-based KD loss to achieve more efficient and robust distillation from a stronger teacher model. Moreover, considering that samples vary in difficulty, CMKD dynamically adjusts the weights of the Pearson-based loss and Spearman-based loss. CMKD is simple yet practical, and extensive experiments demonstrate that it can consistently achieve state-of-the-art performance on CIRAR-100 and ImageNet, and adapts well to various teacher architectures, sizes, and other KD methods.

Paper Structure

This paper contains 38 sections, 24 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: The accuracy of the teacher model and the student model (ResNet14), which are all trained on the clean CIFAR-100 dataset. (a) illustrates the testing Top-1 accuracy on the clean CIFAR-100 dataset, while (b) displays the testing accuracy on the noisy CIFAR-100 dataset with elastic transformations.
  • Figure 2: The confusion matrix between the logits of the teacher model and the student model (ResNet14). The first row shows the confusion matrices on the CIFAR-10 dataset, while the second row displays the confusion matrices on the CIFAR-100 dataset.
  • Figure 3: Spearman and Pearson correlation coefficients between the teacher model output and the student model output during knowledge distillation. The training dataset for (a) and (b) is CIFAR-10, while the training dataset for (c) and (d) is CIFAR-100.
  • Figure 4: An example of implicitly altering the rank relationship of the student model's output through the KL-based KD method, where $r$ is the Pearson correlation coefficient and $\rho$ is the Spearman correlation coefficient.
  • Figure 5: T-SNE dimensionality reduction visualization for the same student under different teachers. The first row shows the results on the CIFAR-10 dataset, and the second row shows the results on the CIFAR-100 dataset.
  • ...and 5 more figures