Hybrid Data-Free Knowledge Distillation
Jialiang Tang, Shuo Chen, Chen Gong
TL;DR
The paper addresses the practical challenge of knowledge distillation without access to large teacher-training datasets by proposing HiDFD, a hybrid data-free approach. It combines a teacher-guided GAN that uses a small set of collected examples to generate high-quality synthetic data with a data-inflated, classifier-sharing distillation strategy to train a compact student on a hybrid dataset. Key contributions include a feature-integrated teacher-guided generation module with category-frequency smoothing and a data-inflation-based distillation module that aligns student and teacher features via a shared classifier, significantly reducing the required collected data. The approach yields state-of-the-art results with only $5{,}000/600{,}000$ collected examples ($1/120$) and demonstrates strong robustness across datasets and backbones, offering a practical path for deploying KD in data-sensitive domains.
Abstract
Data-free knowledge distillation aims to learn a compact student network from a pre-trained large teacher network without using the original training data of the teacher network. Existing collection-based and generation-based methods train student networks by collecting massive real examples and generating synthetic examples, respectively. However, they inevitably become weak in practical scenarios due to the difficulties in gathering or emulating sufficient real-world data. To solve this problem, we propose a novel method called \textbf{H}ybr\textbf{i}d \textbf{D}ata-\textbf{F}ree \textbf{D}istillation (HiDFD), which leverages only a small amount of collected data as well as generates sufficient examples for training student networks. Our HiDFD comprises two primary modules, \textit{i.e.}, the teacher-guided generation and student distillation. The teacher-guided generation module guides a Generative Adversarial Network (GAN) by the teacher network to produce high-quality synthetic examples from very few real-world collected examples. Specifically, we design a feature integration mechanism to prevent the GAN from overfitting and facilitate the reliable representation learning from the teacher network. Meanwhile, we drive a category frequency smoothing technique via the teacher network to balance the generative training of each category. In the student distillation module, we explore a data inflation strategy to properly utilize a blend of real and synthetic data to train the student network via a classifier-sharing-based feature alignment technique. Intensive experiments across multiple benchmarks demonstrate that our HiDFD can achieve state-of-the-art performance using 120 times less collected data than existing methods. Code is available at https://github.com/tangjialiang97/HiDFD.
