Table of Contents
Fetching ...

Adaptive Explicit Knowledge Transfer for Knowledge Distillation

Hyungkeun Park, Jong-Seok Lee

TL;DR

This paper addresses the gap where logit-based KD underperforms feature-based KD by analyzing gradient flows and exposing how implicit knowledge in non-target logits can be learned adaptively. It proposes Adaptive Explicit Knowledge Transfer (AEKT), a loss that adaptively weights learning of the teacher's target-class confidence based on the teacher/student ratio, and introduces a task-serialization FC layer to decouple classification and distillation. The authors provide gradient analyses of KD/DKD, derive AEKT's gradient, and show empirical improvements over state-of-the-art KD methods on CIFAR-100 and ImageNet, along with thorough ablations and insights into inter-class relations learned by the serialization layer. This method offers a practical route to stronger, more efficient knowledge transfer suitable for deployable student models on resource-constrained devices, while preserving rich inter-class structure learned from the teacher.

Abstract

Logit-based knowledge distillation (KD) for classification is cost-efficient compared to feature-based KD but often subject to inferior performance. Recently, it was shown that the performance of logit-based KD can be improved by effectively delivering the probability distribution for the non-target classes from the teacher model, which is known as `implicit (dark) knowledge', to the student model. Through gradient analysis, we first show that this actually has an effect of adaptively controlling the learning of implicit knowledge. Then, we propose a new loss that enables the student to learn explicit knowledge (i.e., the teacher's confidence about the target class) along with implicit knowledge in an adaptive manner. Furthermore, we propose to separate the classification and distillation tasks for effective distillation and inter-class relationship modeling. Experimental results demonstrate that the proposed method, called adaptive explicit knowledge transfer (AEKT) method, achieves improved performance compared to the state-of-the-art KD methods on the CIFAR-100 and ImageNet datasets.

Adaptive Explicit Knowledge Transfer for Knowledge Distillation

TL;DR

This paper addresses the gap where logit-based KD underperforms feature-based KD by analyzing gradient flows and exposing how implicit knowledge in non-target logits can be learned adaptively. It proposes Adaptive Explicit Knowledge Transfer (AEKT), a loss that adaptively weights learning of the teacher's target-class confidence based on the teacher/student ratio, and introduces a task-serialization FC layer to decouple classification and distillation. The authors provide gradient analyses of KD/DKD, derive AEKT's gradient, and show empirical improvements over state-of-the-art KD methods on CIFAR-100 and ImageNet, along with thorough ablations and insights into inter-class relations learned by the serialization layer. This method offers a practical route to stronger, more efficient knowledge transfer suitable for deployable student models on resource-constrained devices, while preserving rich inter-class structure learned from the teacher.

Abstract

Logit-based knowledge distillation (KD) for classification is cost-efficient compared to feature-based KD but often subject to inferior performance. Recently, it was shown that the performance of logit-based KD can be improved by effectively delivering the probability distribution for the non-target classes from the teacher model, which is known as `implicit (dark) knowledge', to the student model. Through gradient analysis, we first show that this actually has an effect of adaptively controlling the learning of implicit knowledge. Then, we propose a new loss that enables the student to learn explicit knowledge (i.e., the teacher's confidence about the target class) along with implicit knowledge in an adaptive manner. Furthermore, we propose to separate the classification and distillation tasks for effective distillation and inter-class relationship modeling. Experimental results demonstrate that the proposed method, called adaptive explicit knowledge transfer (AEKT) method, achieves improved performance compared to the state-of-the-art KD methods on the CIFAR-100 and ImageNet datasets.
Paper Structure (32 sections, 52 equations, 5 figures, 11 tables)

This paper contains 32 sections, 52 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Task serialization with an additional FC layer.
  • Figure 2: Overall architecture of the proposed AEKT method. Three losses for distillation, $\mathcal{L}_{TCKD}$, $\mathcal{L}_{NCKD}$, and $\mathcal{L}_{AEKT}$, are measured between the output probabilities of the teacher ($p_i^T$) and those of the student ($p_i^S$). A FC layer is used to serialize the distillation task and the classification task. The cross-entropy (CE) loss is measured before the FC layer. The FC layer is used only for training of the student; the inference is performed using the logits of the student model.
  • Figure 3: Visualization of the learned weights of the FC layer for task serialization.$w_{ij}$ is shown at the intersection the $i$th row and the $j$th column.
  • Figure 4: Examples of the FC layer weights for ImageNet. The 20 classes showing the largest absolute weight values are shown. Left: 'tiger shark' class. Right: 'Italian greyhound' class.
  • Figure B.1: Accuracy comparison with and without the $1-p_t^S$ term in the gradient backpropagated to the student model's target logit