Table of Contents
Fetching ...

Adaptive Teaching with Shared Classifier for Knowledge Distillation

Jaeyeon Jang, Young-Ik Kim, Jisu Lim, Hyeonseong Lee

Abstract

Knowledge distillation (KD) is a technique used to transfer knowledge from an overparameterized teacher network to a less-parameterized student network, thereby minimizing the incurred performance loss. KD methods can be categorized into offline and online approaches. Offline KD leverages a powerful pretrained teacher network, while online KD allows the teacher network to be adjusted dynamically to enhance the learning effectiveness of the student network. Recently, it has been discovered that sharing the classifier of the teacher network can significantly boost the performance of the student network with only a minimal increase in the number of network parameters. Building on these insights, we propose adaptive teaching with a shared classifier (ATSC). In ATSC, the pretrained teacher network self-adjusts to better align with the learning needs of the student network based on its capabilities, and the student network benefits from the shared classifier, enhancing its performance. Additionally, we extend ATSC to environments with multiple teachers. We conduct extensive experiments, demonstrating the effectiveness of the proposed KD method. Our approach achieves state-of-the-art results on the CIFAR-100 and ImageNet datasets in both single-teacher and multiteacher scenarios, with only a modest increase in the number of required model parameters. The source code is publicly available at https://github.com/random2314235/ATSC.

Adaptive Teaching with Shared Classifier for Knowledge Distillation

Abstract

Knowledge distillation (KD) is a technique used to transfer knowledge from an overparameterized teacher network to a less-parameterized student network, thereby minimizing the incurred performance loss. KD methods can be categorized into offline and online approaches. Offline KD leverages a powerful pretrained teacher network, while online KD allows the teacher network to be adjusted dynamically to enhance the learning effectiveness of the student network. Recently, it has been discovered that sharing the classifier of the teacher network can significantly boost the performance of the student network with only a minimal increase in the number of network parameters. Building on these insights, we propose adaptive teaching with a shared classifier (ATSC). In ATSC, the pretrained teacher network self-adjusts to better align with the learning needs of the student network based on its capabilities, and the student network benefits from the shared classifier, enhancing its performance. Additionally, we extend ATSC to environments with multiple teachers. We conduct extensive experiments, demonstrating the effectiveness of the proposed KD method. Our approach achieves state-of-the-art results on the CIFAR-100 and ImageNet datasets in both single-teacher and multiteacher scenarios, with only a modest increase in the number of required model parameters. The source code is publicly available at https://github.com/random2314235/ATSC.
Paper Structure (19 sections, 7 equations, 4 figures, 15 tables, 1 algorithm)

This paper contains 19 sections, 7 equations, 4 figures, 15 tables, 1 algorithm.

Figures (4)

  • Figure 1: Illustrative comparison among different KD methods. Clf. and Proj. denote a classifier and a projector, respectively. The main differences among these methods include their loss definitions, the flow of gradients, and teacher roles during the learning process. (a) In vanilla KD, gradients are derived from two losses: a loss comparing the last-layer logits of the pretrained teacher and the student and a prediction loss. (b) Feature distillation extends beyond vanilla KD by also extracting gradient information from the intermediate layers of the encoder. (c) In general, online KD is a dynamic form of feature distillation in which both the teacher and the student alternately apply distillation techniques to each other. (d) In SimKD, the student is trained to map its representations to those produced by the encoder of the pretrained teacher; this step is facilitated by an additional projector. This method also involves sharing the classifier of the large teacher network with the smaller student network to maintain high discriminative capabilities. (e) Our proposed ATSC approach enables the teacher to not only guide the student but also adaptively fine-tune its encoder parameters to better support the learning procedure of the student. Furthermore, the classifier is optimized to consider the updated encoder of the teacher, ensuring a more effective and integrated learning process.
  • Figure 2: An overview of the proposed ATSC approach. In (a), $\mathcal{L}_{MSE}(E_T(\bm{x}), \mathcal{P}(E_S(\bm{x})))$ represents the MSE loss between the representations derived from the teacher and student models, and $\mathcal{L}_{MSE}(\bm{\theta}^*_{E_T}, \bm{\theta}_{E_T})$ denotes the penalty imposed on the parameter changes exhibited by the encoder of the teacher. In (b), $\mathcal{L}_{CE}(\bm{y}, \sigma(C(E_T(\bm{x}))))$ denotes the cross-entropy loss.
  • Figure 3: Top-1 mean accuracies (with standard deviations) achieved over 4 separate trials under different values of the balancing parameter $\alpha$.
  • Figure 9: Top-1 test accuracy (%) changes exhibited after adapting pretrained teachers. 'Teacher' denotes the performance of the pretrained teacher. We report the average accuracy ($\pm$ standard deviation) of the teacher model after training through ATSC over 4 trials.