Table of Contents
Fetching ...

AI-KD: Adversarial learning and Implicit regularization for self-Knowledge Distillation

Hyungmin Kim, Sungho Suh, Sunghyun Baek, Daehwan Kim, Daun Jeong, Hansang Cho, Junmo Kim

Abstract

We present a novel adversarial penalized self-knowledge distillation method, named adversarial learning and implicit regularization for self-knowledge distillation (AI-KD), which regularizes the training procedure by adversarial learning and implicit distillations. Our model not only distills the deterministic and progressive knowledge which are from the pre-trained and previous epoch predictive probabilities but also transfers the knowledge of the deterministic predictive distributions using adversarial learning. The motivation is that the self-knowledge distillation methods regularize the predictive probabilities with soft targets, but the exact distributions may be hard to predict. Our method deploys a discriminator to distinguish the distributions between the pre-trained and student models while the student model is trained to fool the discriminator in the trained procedure. Thus, the student model not only can learn the pre-trained model's predictive probabilities but also align the distributions between the pre-trained and student models. We demonstrate the effectiveness of the proposed method with network architectures on multiple datasets and show the proposed method achieves better performance than state-of-the-art methods.

AI-KD: Adversarial learning and Implicit regularization for self-Knowledge Distillation

Abstract

We present a novel adversarial penalized self-knowledge distillation method, named adversarial learning and implicit regularization for self-knowledge distillation (AI-KD), which regularizes the training procedure by adversarial learning and implicit distillations. Our model not only distills the deterministic and progressive knowledge which are from the pre-trained and previous epoch predictive probabilities but also transfers the knowledge of the deterministic predictive distributions using adversarial learning. The motivation is that the self-knowledge distillation methods regularize the predictive probabilities with soft targets, but the exact distributions may be hard to predict. Our method deploys a discriminator to distinguish the distributions between the pre-trained and student models while the student model is trained to fool the discriminator in the trained procedure. Thus, the student model not only can learn the pre-trained model's predictive probabilities but also align the distributions between the pre-trained and student models. We demonstrate the effectiveness of the proposed method with network architectures on multiple datasets and show the proposed method achieves better performance than state-of-the-art methods.
Paper Structure (19 sections, 16 equations, 6 figures, 8 tables)

This paper contains 19 sections, 16 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Concept comparison with conventional knowledge distillation methods. Each point denotes a latent vector represented from the different input data. (a) The solid line indicates that the conventional knowledge distillation is trained to minimize the distance between the latent vector point in the teacher network and the corresponding latent vector point in the student network. (b) The dashed line indicates that AI-KD is trained to minimize the distance between the distribution of the latent vectors in the teacher network and the distribution of the corresponding latent vectors in the student network.
  • Figure 2: The overall framework of AI-KD. Training models through AI-KD have a two-step process. The first phase is training the pre-trained model from scratch. The pre-trained model serves as the baseline and is utilized as the superior student model for the next phase. Then the second phase is training the student model via AI-KD. The last phase is for inference, during which the student model is utilized without the need for the superior student model, the previous student model, or the discriminator.
  • Figure 3: The overall losses of AI-KD. The student model is trained implicitly distilled knowledge from the superior pre-trained and previous student models. Through adversarial learning, the student is aligned to the distributions of the superior pre-trained model. The solid lines are feed-forward operations, and the dashed lines represent backward operations used to update the model from each loss.
  • Figure 4: Confidence reliability diagrams based on ResNet-18 with various datasets.
  • Figure 5: Confidence reliability diagrams based on DenseNet-121 with various datasets.
  • ...and 1 more figures