Table of Contents
Fetching ...

Knowledge Distillation in Generations: More Tolerant Teachers Educate Better Students

Chenglin Yang, Lingxi Xie, Siyuan Qiao, Alan Yuille

Abstract

We focus on the problem of training a deep neural network in generations. The flowchart is that, in order to optimize the target network (student), another network (teacher) with the same architecture is first trained, and used to provide part of supervision signals in the next stage. While this strategy leads to a higher accuracy, many aspects (e.g., why teacher-student optimization helps) still need further explorations. This paper studies this problem from a perspective of controlling the strictness in training the teacher network. Existing approaches mostly used a hard distribution (e.g., one-hot vectors) in training, leading to a strict teacher which itself has a high accuracy, but we argue that the teacher needs to be more tolerant, although this often implies a lower accuracy. The implementation is very easy, with merely an extra loss term added to the teacher network, facilitating a few secondary classes to emerge and complement to the primary class. Consequently, the teacher provides a milder supervision signal (a less peaked distribution), and makes it possible for the student to learn from inter-class similarity and potentially lower the risk of over-fitting. Experiments are performed on standard image classification tasks (CIFAR100 and ILSVRC2012). Although the teacher network behaves less powerful, the students show a persistent ability growth and eventually achieve higher classification accuracies than other competitors. Model ensemble and transfer feature extraction also verify the effectiveness of our approach.

Knowledge Distillation in Generations: More Tolerant Teachers Educate Better Students

Abstract

We focus on the problem of training a deep neural network in generations. The flowchart is that, in order to optimize the target network (student), another network (teacher) with the same architecture is first trained, and used to provide part of supervision signals in the next stage. While this strategy leads to a higher accuracy, many aspects (e.g., why teacher-student optimization helps) still need further explorations. This paper studies this problem from a perspective of controlling the strictness in training the teacher network. Existing approaches mostly used a hard distribution (e.g., one-hot vectors) in training, leading to a strict teacher which itself has a high accuracy, but we argue that the teacher needs to be more tolerant, although this often implies a lower accuracy. The implementation is very easy, with merely an extra loss term added to the teacher network, facilitating a few secondary classes to emerge and complement to the primary class. Consequently, the teacher provides a milder supervision signal (a less peaked distribution), and makes it possible for the student to learn from inter-class similarity and potentially lower the risk of over-fitting. Experiments are performed on standard image classification tasks (CIFAR100 and ILSVRC2012). Although the teacher network behaves less powerful, the students show a persistent ability growth and eventually achieve higher classification accuracies than other competitors. Model ensemble and transfer feature extraction also verify the effectiveness of our approach.

Paper Structure

This paper contains 13 sections, 4 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The $100\times100$ confusion matrices produced by the patriarch and the first five students of a born-again process, training a $110$-layer ResNet on CIFAR100. See Table \ref{['Tab:TopConfidenceScores']} for quantitative numbers. The rows in these matrices indicate the ground-truth class, and the columns indicate the class with the second highest confidence score. The color of a cell is closer to yellow when the corresponding value is larger.
  • Figure 2: Classification accuracy ($\%$) on CIFAR100, produced by different training-in-generation processes. The baseline approach (single generation) corresponds to $\mathfrak{D}\!\left(1.0,0.0\right)$, and $\mathfrak{D}\!\left(1.0,0.5\right)$ and $\mathfrak{D}\!\left(1.0,0.6\right)$ are born-again networks. LSR-$0.6$ and CP-$0.6$ indicate replacing the patriarch model with label smoothing regularization and confidence penalty, and use ${\lambda}={0.6}$ in generations. All three plots share the same legend (shown in the first plot).