Table of Contents
Fetching ...

Hybrid Data-Free Knowledge Distillation

Jialiang Tang, Shuo Chen, Chen Gong

TL;DR

The paper addresses the practical challenge of knowledge distillation without access to large teacher-training datasets by proposing HiDFD, a hybrid data-free approach. It combines a teacher-guided GAN that uses a small set of collected examples to generate high-quality synthetic data with a data-inflated, classifier-sharing distillation strategy to train a compact student on a hybrid dataset. Key contributions include a feature-integrated teacher-guided generation module with category-frequency smoothing and a data-inflation-based distillation module that aligns student and teacher features via a shared classifier, significantly reducing the required collected data. The approach yields state-of-the-art results with only $5{,}000/600{,}000$ collected examples ($1/120$) and demonstrates strong robustness across datasets and backbones, offering a practical path for deploying KD in data-sensitive domains.

Abstract

Data-free knowledge distillation aims to learn a compact student network from a pre-trained large teacher network without using the original training data of the teacher network. Existing collection-based and generation-based methods train student networks by collecting massive real examples and generating synthetic examples, respectively. However, they inevitably become weak in practical scenarios due to the difficulties in gathering or emulating sufficient real-world data. To solve this problem, we propose a novel method called \textbf{H}ybr\textbf{i}d \textbf{D}ata-\textbf{F}ree \textbf{D}istillation (HiDFD), which leverages only a small amount of collected data as well as generates sufficient examples for training student networks. Our HiDFD comprises two primary modules, \textit{i.e.}, the teacher-guided generation and student distillation. The teacher-guided generation module guides a Generative Adversarial Network (GAN) by the teacher network to produce high-quality synthetic examples from very few real-world collected examples. Specifically, we design a feature integration mechanism to prevent the GAN from overfitting and facilitate the reliable representation learning from the teacher network. Meanwhile, we drive a category frequency smoothing technique via the teacher network to balance the generative training of each category. In the student distillation module, we explore a data inflation strategy to properly utilize a blend of real and synthetic data to train the student network via a classifier-sharing-based feature alignment technique. Intensive experiments across multiple benchmarks demonstrate that our HiDFD can achieve state-of-the-art performance using 120 times less collected data than existing methods. Code is available at https://github.com/tangjialiang97/HiDFD.

Hybrid Data-Free Knowledge Distillation

TL;DR

The paper addresses the practical challenge of knowledge distillation without access to large teacher-training datasets by proposing HiDFD, a hybrid data-free approach. It combines a teacher-guided GAN that uses a small set of collected examples to generate high-quality synthetic data with a data-inflated, classifier-sharing distillation strategy to train a compact student on a hybrid dataset. Key contributions include a feature-integrated teacher-guided generation module with category-frequency smoothing and a data-inflation-based distillation module that aligns student and teacher features via a shared classifier, significantly reducing the required collected data. The approach yields state-of-the-art results with only collected examples () and demonstrates strong robustness across datasets and backbones, offering a practical path for deploying KD in data-sensitive domains.

Abstract

Data-free knowledge distillation aims to learn a compact student network from a pre-trained large teacher network without using the original training data of the teacher network. Existing collection-based and generation-based methods train student networks by collecting massive real examples and generating synthetic examples, respectively. However, they inevitably become weak in practical scenarios due to the difficulties in gathering or emulating sufficient real-world data. To solve this problem, we propose a novel method called \textbf{H}ybr\textbf{i}d \textbf{D}ata-\textbf{F}ree \textbf{D}istillation (HiDFD), which leverages only a small amount of collected data as well as generates sufficient examples for training student networks. Our HiDFD comprises two primary modules, \textit{i.e.}, the teacher-guided generation and student distillation. The teacher-guided generation module guides a Generative Adversarial Network (GAN) by the teacher network to produce high-quality synthetic examples from very few real-world collected examples. Specifically, we design a feature integration mechanism to prevent the GAN from overfitting and facilitate the reliable representation learning from the teacher network. Meanwhile, we drive a category frequency smoothing technique via the teacher network to balance the generative training of each category. In the student distillation module, we explore a data inflation strategy to properly utilize a blend of real and synthetic data to train the student network via a classifier-sharing-based feature alignment technique. Intensive experiments across multiple benchmarks demonstrate that our HiDFD can achieve state-of-the-art performance using 120 times less collected data than existing methods. Code is available at https://github.com/tangjialiang97/HiDFD.

Paper Structure

This paper contains 15 sections, 14 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The diagram of (a) generation-based methods fang2021contrastiveyin2020dreamingchen2019datamicaelli2019zero, (b) collection-based methods chen2021learningtang2023distribution, and (c) our HiDFD. In HiDFD, the teacher-guided generation module employs the teacher network to guide the training of the GAN on limited collected data. Subsequently, the student distillation module closely aligns the features of the student network with those of the teacher network on the hybrid data comprising high-quality synthetic examples and properly inflated collected examples.
  • Figure 2: Parametric sensitivities of (a) $\lambda_{\mathrm{d}}$ and (b) $\lambda_{\mathrm{g}}$ in Eq. \ref{['eq_total']}. Accuracies (in %) of the student networks trained with collected data with (c) varying inflation factors and (d) various quantities.