Generalizing Teacher Networks for Effective Knowledge Distillation Across Student Architectures
Kuluhan Binici, Weiming Wu, Tulika Mitra
TL;DR
This work tackles the knowledge distillation capacity-gap between teacher and student networks by introducing Generic Teacher Networks (GTN), a one-off training procedure that yields a KD-aware generic teacher usable across a finite pool of student architectures. GTN grounds the teacher in the capacities of multiple reference students sampled from a weight-sharing supernet, using path sampling and a trainable distribution to expose the teacher to diverse student configurations without retraining for each pair. The method combines a conditioning loss and a dedicated architecture loss to regularize the teacher toward the combined capabilities of the student pool, followed by a standard KD phase to transfer knowledge to any target student. Empirical results on CIFAR-100 and ImageNet-200 show that GTN improves KD performance across random and NAS-derived students, with a constant overhead comparable to training a few specialized teachers, enabling scalable deployment across heterogeneous hardware platforms.
Abstract
Knowledge distillation (KD) is a model compression method that entails training a compact student model to emulate the performance of a more complex teacher model. However, the architectural capacity gap between the two models limits the effectiveness of knowledge transfer. Addressing this issue, previous works focused on customizing teacher-student pairs to improve compatibility, a computationally expensive process that needs to be repeated every time either model changes. Hence, these methods are impractical when a teacher model has to be compressed into different student models for deployment on multiple hardware devices with distinct resource constraints. In this work, we propose Generic Teacher Network (GTN), a one-off KD-aware training to create a generic teacher capable of effectively transferring knowledge to any student model sampled from a given finite pool of architectures. To this end, we represent the student pool as a weight-sharing supernet and condition our generic teacher to align with the capacities of various student architectures sampled from this supernet. Experimental evaluation shows that our method both improves overall KD effectiveness and amortizes the minimal additional training cost of the generic teacher across students in the pool.
