Generalizing Teacher Networks for Effective Knowledge Distillation Across Student Architectures

Kuluhan Binici; Weiming Wu; Tulika Mitra

Generalizing Teacher Networks for Effective Knowledge Distillation Across Student Architectures

Kuluhan Binici, Weiming Wu, Tulika Mitra

TL;DR

This work tackles the knowledge distillation capacity-gap between teacher and student networks by introducing Generic Teacher Networks (GTN), a one-off training procedure that yields a KD-aware generic teacher usable across a finite pool of student architectures. GTN grounds the teacher in the capacities of multiple reference students sampled from a weight-sharing supernet, using path sampling and a trainable distribution to expose the teacher to diverse student configurations without retraining for each pair. The method combines a conditioning loss and a dedicated architecture loss to regularize the teacher toward the combined capabilities of the student pool, followed by a standard KD phase to transfer knowledge to any target student. Empirical results on CIFAR-100 and ImageNet-200 show that GTN improves KD performance across random and NAS-derived students, with a constant overhead comparable to training a few specialized teachers, enabling scalable deployment across heterogeneous hardware platforms.

Abstract

Knowledge distillation (KD) is a model compression method that entails training a compact student model to emulate the performance of a more complex teacher model. However, the architectural capacity gap between the two models limits the effectiveness of knowledge transfer. Addressing this issue, previous works focused on customizing teacher-student pairs to improve compatibility, a computationally expensive process that needs to be repeated every time either model changes. Hence, these methods are impractical when a teacher model has to be compressed into different student models for deployment on multiple hardware devices with distinct resource constraints. In this work, we propose Generic Teacher Network (GTN), a one-off KD-aware training to create a generic teacher capable of effectively transferring knowledge to any student model sampled from a given finite pool of architectures. To this end, we represent the student pool as a weight-sharing supernet and condition our generic teacher to align with the capacities of various student architectures sampled from this supernet. Experimental evaluation shows that our method both improves overall KD effectiveness and amortizes the minimal additional training cost of the generic teacher across students in the pool.

Generalizing Teacher Networks for Effective Knowledge Distillation Across Student Architectures

TL;DR

Abstract

Paper Structure (15 sections, 6 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 15 sections, 6 equations, 3 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Knowledge Distillation (KD)
Teacher-Student Capacity Gap in KD
Neural Architecture Search & Supernet Architectures
Method
Conditioning the Teacher Based on the Capacity of a Reference Student Architecture
GTN Training Using a Supernet as Reference Architecture
Knowledge Distillation
Experimental Evaluation
Implementation Details:
KD with random students:
KD with students obtained by NAS
Conclusion
Acknowledgement

Figures (3)

Figure 1: Illustration of the capacity gap problem in KD and the motivation behind our proposed generic teacher approach.
Figure 2: Overview of our GTN framework. (a) Teacher model is regularized based on the capacity of a reference student. (b) Supernet blocks allow the architecture of the student branches to be reconfigured, exposing the teacher to various reference students for regularization. (c) Static student blocks used to train specialised teachers.
Figure 3: Training time comparison for ResNet32, WRN40-2 and EfficientNet-b0 teachers respectively. Dashed vertical lines colored in red mark the # of students after which our GTN method attains the time cost advantage.

Generalizing Teacher Networks for Effective Knowledge Distillation Across Student Architectures

TL;DR

Abstract

Generalizing Teacher Networks for Effective Knowledge Distillation Across Student Architectures

Authors

TL;DR

Abstract

Table of Contents

Figures (3)