Table of Contents
Fetching ...

Can Students Beyond The Teacher? Distilling Knowledge from Teacher's Bias

Jianhua Zhang, Yi Gao, Ruyu Liu, Xu Cheng, Houxiang Zhang, Shengyong Chen

TL;DR

Knowledge distillation is often limited by teacher bias, where the teacher's distribution $P^T$ contains biased components such as $P^T_{wrong}$ that mislead the student. The authors propose a bias-aware KD framework comprising bias elimination to discard biased knowledge, bias rectification to adjust wrong predictions toward the true labels, and a dynamic loss with $\

Abstract

Knowledge distillation (KD) is a model compression technique that transfers knowledge from a large teacher model to a smaller student model to enhance its performance. Existing methods often assume that the student model is inherently inferior to the teacher model. However, we identify that the fundamental issue affecting student performance is the bias transferred by the teacher. Current KD frameworks transmit both right and wrong knowledge, introducing bias that misleads the student model. To address this issue, we propose a novel strategy to rectify bias and greatly improve the student model's performance. Our strategy involves three steps: First, we differentiate knowledge and design a bias elimination method to filter out biases, retaining only the right knowledge for the student model to learn. Next, we propose a bias rectification method to rectify the teacher model's wrong predictions, fundamentally addressing bias interference. The student model learns from both the right knowledge and the rectified biases, greatly improving its prediction accuracy. Additionally, we introduce a dynamic learning approach with a loss function that updates weights dynamically, allowing the student model to quickly learn right knowledge-based easy tasks initially and tackle hard tasks corresponding to biases later, greatly enhancing the student model's learning efficiency. To the best of our knowledge, this is the first strategy enabling the student model to surpass the teacher model. Experiments demonstrate that our strategy, as a plug-and-play module, is versatile across various mainstream KD frameworks. We will release our code after the paper is accepted.

Can Students Beyond The Teacher? Distilling Knowledge from Teacher's Bias

TL;DR

Knowledge distillation is often limited by teacher bias, where the teacher's distribution contains biased components such as that mislead the student. The authors propose a bias-aware KD framework comprising bias elimination to discard biased knowledge, bias rectification to adjust wrong predictions toward the true labels, and a dynamic loss with $\

Abstract

Knowledge distillation (KD) is a model compression technique that transfers knowledge from a large teacher model to a smaller student model to enhance its performance. Existing methods often assume that the student model is inherently inferior to the teacher model. However, we identify that the fundamental issue affecting student performance is the bias transferred by the teacher. Current KD frameworks transmit both right and wrong knowledge, introducing bias that misleads the student model. To address this issue, we propose a novel strategy to rectify bias and greatly improve the student model's performance. Our strategy involves three steps: First, we differentiate knowledge and design a bias elimination method to filter out biases, retaining only the right knowledge for the student model to learn. Next, we propose a bias rectification method to rectify the teacher model's wrong predictions, fundamentally addressing bias interference. The student model learns from both the right knowledge and the rectified biases, greatly improving its prediction accuracy. Additionally, we introduce a dynamic learning approach with a loss function that updates weights dynamically, allowing the student model to quickly learn right knowledge-based easy tasks initially and tackle hard tasks corresponding to biases later, greatly enhancing the student model's learning efficiency. To the best of our knowledge, this is the first strategy enabling the student model to surpass the teacher model. Experiments demonstrate that our strategy, as a plug-and-play module, is versatile across various mainstream KD frameworks. We will release our code after the paper is accepted.

Paper Structure

This paper contains 12 sections, 9 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Diagram of the knowledge transfer process in KD. (a) The KD framework based on a teacher-student network model. (b) Right knowledge transfer helps the student model learn effectively. (c) Bias knowledge transfer can mislead the student model with biased knowledge.
  • Figure 2: The overall framework of our framework. Module ① is a KD framework based on the teacher-student model. Module ② is the bias elimination method. Module ③ is the bias rectification method. Module ④ is the dynamic learning approach.
  • Figure 3: Diagram of the rectification method. Each column corresponds to a category, $Label$ corresponds to the one-hot vector of ground truth, $P^T_{wrong}$ corresponds to the probability value predicted by the teacher on each category, $P^T_{re}$ corresponds to the probability value after rectification. (a) shows the teacher's wrong prediction, (b) shows the predicted probability is consistent with $Label$ after rectification, (c) represents further adjustment of the rectification predicted probabilities.
  • Figure 4: The accuracy variation curves during the training process for several KD methods. The x-axis is training epochs, and the y-axis is the accuracy of the ResNet-34 model on the CIFAR-100 validation set.