Table of Contents
Fetching ...

Highlight Every Step: Knowledge Distillation via Collaborative Teaching

Haoran Zhao, Xin Sun, Junyu Dong, Changrui Chen, Zihe Dong

TL;DR

The paper addresses the challenge of deploying high-performing models on resource-constrained devices by enhancing knowledge distillation. It introduces Collaborative Teaching Knowledge Distillation (CTKD), which jointly trains a student with a scratch teacher (providing stepwise path-to-final logits) and a pretrained expert teacher (providing intermediate attention guidance). This dual supervision enables the student to closely approach the final targets while focusing on salient regions, yielding strong gains across CIFAR-10/100, SVHN, and Tiny ImageNet. Empirical results show CTKD outperforms standard KD and recent variants, highlighting the value of combining path-level and intermediate supervision in KD for model compression.

Abstract

High storage and computational costs obstruct deep neural networks to be deployed on resource-constrained devices. Knowledge distillation aims to train a compact student network by transferring knowledge from a larger pre-trained teacher model. However, most existing methods on knowledge distillation ignore the valuable information among training process associated with training results. In this paper, we provide a new Collaborative Teaching Knowledge Distillation (CTKD) strategy which employs two special teachers. Specifically, one teacher trained from scratch (i.e., scratch teacher) assists the student step by step using its temporary outputs. It forces the student to approach the optimal path towards the final logits with high accuracy. The other pre-trained teacher (i.e., expert teacher) guides the student to focus on a critical region which is more useful for the task. The combination of the knowledge from two special teachers can significantly improve the performance of the student network in knowledge distillation. The results of experiments on CIFAR-10, CIFAR-100, SVHN and Tiny ImageNet datasets verify that the proposed knowledge distillation method is efficient and achieves state-of-the-art performance.

Highlight Every Step: Knowledge Distillation via Collaborative Teaching

TL;DR

The paper addresses the challenge of deploying high-performing models on resource-constrained devices by enhancing knowledge distillation. It introduces Collaborative Teaching Knowledge Distillation (CTKD), which jointly trains a student with a scratch teacher (providing stepwise path-to-final logits) and a pretrained expert teacher (providing intermediate attention guidance). This dual supervision enables the student to closely approach the final targets while focusing on salient regions, yielding strong gains across CIFAR-10/100, SVHN, and Tiny ImageNet. Empirical results show CTKD outperforms standard KD and recent variants, highlighting the value of combining path-level and intermediate supervision in KD for model compression.

Abstract

High storage and computational costs obstruct deep neural networks to be deployed on resource-constrained devices. Knowledge distillation aims to train a compact student network by transferring knowledge from a larger pre-trained teacher model. However, most existing methods on knowledge distillation ignore the valuable information among training process associated with training results. In this paper, we provide a new Collaborative Teaching Knowledge Distillation (CTKD) strategy which employs two special teachers. Specifically, one teacher trained from scratch (i.e., scratch teacher) assists the student step by step using its temporary outputs. It forces the student to approach the optimal path towards the final logits with high accuracy. The other pre-trained teacher (i.e., expert teacher) guides the student to focus on a critical region which is more useful for the task. The combination of the knowledge from two special teachers can significantly improve the performance of the student network in knowledge distillation. The results of experiments on CIFAR-10, CIFAR-100, SVHN and Tiny ImageNet datasets verify that the proposed knowledge distillation method is efficient and achieves state-of-the-art performance.

Paper Structure

This paper contains 13 sections, 4 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of our collaborative teaching knowledge distillation (CTKD) strategy. We illustrate the optimization process of student network (green ball) under the collaborative guidance of scratch teacher (red ball) and expert teacher (black ball). The red and green line represent the optimization path of scratch teacher and student network. And the expert teacher has already reached the local optimum. The student network starts the optimization process with scratch teacher and expert teacher.
  • Figure 2: Illustration of the architecture. The scratch teacher collaboratively trains with the student network from scratch. We use standard cross-entropy loss for scratch teacher network and student network to learn the ground truth respectively. Moreover, the distillation loss supervises the training of student network by every step. The expert teacher (pre-trained) guides the student network to focus on critical region through intermediate-level attention maps.
  • Figure 3: Visualization of top activation attention maps of WRN-16-1 (b) and WRN-40-1 (c). The deep model focuses on more critical region than the shallow one due to its powerful ability.
  • Figure 4: Structure of wide residual networks. (a) describe the basic residual blocks which is used in our base architecture. The widen factor m determine the network's width and n means the number of bottlenecks in each group. (b)(c) show a pair of teacher-student network, WRN-40-1 and WRN-16-1.
  • Figure 5: (a) the testing accuracy of scratch teacher, student from our knowledge distillation method and student trains individually. (b) Training loss and testing accuracy of different knowledge transfer methods on CIFAR-10.
  • ...and 2 more figures