Table of Contents
Fetching ...

Bayes Conditional Distribution Estimation for Knowledge Distillation Based on Conditional Mutual Information

Linfeng Ye, Shayan Mohajer Hamidi, Renhao Tan, En-Hui Yang

TL;DR

It is shown that by employing a teacher trained via MCMI estimation rather than one trained via MLL estimation in various state-of-the-art KD frameworks, the student's classification accuracy consistently increases, suggesting that the teacher's BCPD estimate provided by MCMI method is more accurate than that provided by MLL method.

Abstract

It is believed that in knowledge distillation (KD), the role of the teacher is to provide an estimate for the unknown Bayes conditional probability distribution (BCPD) to be used in the student training process. Conventionally, this estimate is obtained by training the teacher using maximum log-likelihood (MLL) method. To improve this estimate for KD, in this paper we introduce the concept of conditional mutual information (CMI) into the estimation of BCPD and propose a novel estimator called the maximum CMI (MCMI) method. Specifically, in MCMI estimation, both the log-likelihood and CMI of the teacher are simultaneously maximized when the teacher is trained. Through Eigen-CAM, it is further shown that maximizing the teacher's CMI value allows the teacher to capture more contextual information in an image cluster. Via conducting a thorough set of experiments, we show that by employing a teacher trained via MCMI estimation rather than one trained via MLL estimation in various state-of-the-art KD frameworks, the student's classification accuracy consistently increases, with the gain of up to 3.32\%. This suggests that the teacher's BCPD estimate provided by MCMI method is more accurate than that provided by MLL method. In addition, we show that such improvements in the student's accuracy are more drastic in zero-shot and few-shot settings. Notably, the student's accuracy increases with the gain of up to 5.72\% when 5\% of the training samples are available to the student (few-shot), and increases from 0\% to as high as 84\% for an omitted class (zero-shot). The code is available at \url{https://github.com/iclr2024mcmi/ICLRMCMI}.

Bayes Conditional Distribution Estimation for Knowledge Distillation Based on Conditional Mutual Information

TL;DR

It is shown that by employing a teacher trained via MCMI estimation rather than one trained via MLL estimation in various state-of-the-art KD frameworks, the student's classification accuracy consistently increases, suggesting that the teacher's BCPD estimate provided by MCMI method is more accurate than that provided by MLL method.

Abstract

It is believed that in knowledge distillation (KD), the role of the teacher is to provide an estimate for the unknown Bayes conditional probability distribution (BCPD) to be used in the student training process. Conventionally, this estimate is obtained by training the teacher using maximum log-likelihood (MLL) method. To improve this estimate for KD, in this paper we introduce the concept of conditional mutual information (CMI) into the estimation of BCPD and propose a novel estimator called the maximum CMI (MCMI) method. Specifically, in MCMI estimation, both the log-likelihood and CMI of the teacher are simultaneously maximized when the teacher is trained. Through Eigen-CAM, it is further shown that maximizing the teacher's CMI value allows the teacher to capture more contextual information in an image cluster. Via conducting a thorough set of experiments, we show that by employing a teacher trained via MCMI estimation rather than one trained via MLL estimation in various state-of-the-art KD frameworks, the student's classification accuracy consistently increases, with the gain of up to 3.32\%. This suggests that the teacher's BCPD estimate provided by MCMI method is more accurate than that provided by MLL method. In addition, we show that such improvements in the student's accuracy are more drastic in zero-shot and few-shot settings. Notably, the student's accuracy increases with the gain of up to 5.72\% when 5\% of the training samples are available to the student (few-shot), and increases from 0\% to as high as 84\% for an omitted class (zero-shot). The code is available at \url{https://github.com/iclr2024mcmi/ICLRMCMI}.
Paper Structure (44 sections, 1 theorem, 16 equations, 19 figures, 13 tables)

This paper contains 44 sections, 1 theorem, 16 equations, 19 figures, 13 tables.

Key Result

Proposition 1

If $\boldsymbol{f}_{x}$ is an intermediate feature map of a DNN corresponding to the input $x$, then, $I(X;\hat{Y}|~Y) \leq I(X;\boldsymbol{f}_{X}|~Y)$See app:prop for the proof..

Figures (19)

  • Figure 1: The teacher's CMI value (red curve, right axis) along with the student's accuracy (blue bars, left axis) Vs. the teacher's temperature in conventional KD for three different teacher-student pairs.
  • Figure 2: The evolution of CMI and LL values for a teacher trained by MLL during the training.
  • Figure 3: Eigen-CAM for MLL and MCMI teachers for 4 samples from class "Toy Terrier".
  • Figure 4: The student's confusion matrices when it is trained through the MLL teacher (left), and by MCMI teacher (right).
  • Figure 5: The effect of teacher's size on the (i) student's accuracy, (ii) teacher's CMI, and (iii) teacher's LL for both MLL and MCI teachers. The student model is ResNet8, and the dataset is CIFAR-100.
  • ...and 14 more figures

Theorems & Definitions (3)

  • Remark
  • Proposition 1
  • proof