Table of Contents
Fetching ...

Small Scale Data-Free Knowledge Distillation

He Liu, Yikai Wang, Huaping Liu, Fuchun Sun, Anbang Yao

TL;DR

The paper tackles the inefficiency of data-free knowledge distillation (D-KD), where large teacher models guide a smaller student using synthetic data. It introduces Small Scale Data-free KD (SSD-KD), which uses a small-scale inverted data regime guided by a diversity- and difficulty-aware modulating function $\phi(x)$ and a reinforcement-learning-driven priority sampling $\delta$ with a dynamic replay buffer to curate informative synthetic samples. Across CIFAR-10/100 and NYUv2, SSD-KD achieves 1-2 orders of magnitude faster end-to-end training while delivering competitive or improved student performance, even with as little as 10% of the original data scale. This framework enables practical, privacy-preserving D-KD on resource-constrained settings and provides a principled approach to balancing sample diversity and difficulty during both data inversion and distillation.

Abstract

Data-free knowledge distillation is able to utilize the knowledge learned by a large teacher network to augment the training of a smaller student network without accessing the original training data, avoiding privacy, security, and proprietary risks in real applications. In this line of research, existing methods typically follow an inversion-and-distillation paradigm in which a generative adversarial network on-the-fly trained with the guidance of the pre-trained teacher network is used to synthesize a large-scale sample set for knowledge distillation. In this paper, we reexamine this common data-free knowledge distillation paradigm, showing that there is considerable room to improve the overall training efficiency through a lens of ``small-scale inverted data for knowledge distillation". In light of three empirical observations indicating the importance of how to balance class distributions in terms of synthetic sample diversity and difficulty during both data inversion and distillation processes, we propose Small Scale Data-free Knowledge Distillation SSD-KD. In formulation, SSD-KD introduces a modulating function to balance synthetic samples and a priority sampling function to select proper samples, facilitated by a dynamic replay buffer and a reinforcement learning strategy. As a result, SSD-KD can perform distillation training conditioned on an extremely small scale of synthetic samples (e.g., 10X less than the original training data scale), making the overall training efficiency one or two orders of magnitude faster than many mainstream methods while retaining superior or competitive model performance, as demonstrated on popular image classification and semantic segmentation benchmarks. The code is available at https://github.com/OSVAI/SSD-KD.

Small Scale Data-Free Knowledge Distillation

TL;DR

The paper tackles the inefficiency of data-free knowledge distillation (D-KD), where large teacher models guide a smaller student using synthetic data. It introduces Small Scale Data-free KD (SSD-KD), which uses a small-scale inverted data regime guided by a diversity- and difficulty-aware modulating function and a reinforcement-learning-driven priority sampling with a dynamic replay buffer to curate informative synthetic samples. Across CIFAR-10/100 and NYUv2, SSD-KD achieves 1-2 orders of magnitude faster end-to-end training while delivering competitive or improved student performance, even with as little as 10% of the original data scale. This framework enables practical, privacy-preserving D-KD on resource-constrained settings and provides a principled approach to balancing sample diversity and difficulty during both data inversion and distillation.

Abstract

Data-free knowledge distillation is able to utilize the knowledge learned by a large teacher network to augment the training of a smaller student network without accessing the original training data, avoiding privacy, security, and proprietary risks in real applications. In this line of research, existing methods typically follow an inversion-and-distillation paradigm in which a generative adversarial network on-the-fly trained with the guidance of the pre-trained teacher network is used to synthesize a large-scale sample set for knowledge distillation. In this paper, we reexamine this common data-free knowledge distillation paradigm, showing that there is considerable room to improve the overall training efficiency through a lens of ``small-scale inverted data for knowledge distillation". In light of three empirical observations indicating the importance of how to balance class distributions in terms of synthetic sample diversity and difficulty during both data inversion and distillation processes, we propose Small Scale Data-free Knowledge Distillation SSD-KD. In formulation, SSD-KD introduces a modulating function to balance synthetic samples and a priority sampling function to select proper samples, facilitated by a dynamic replay buffer and a reinforcement learning strategy. As a result, SSD-KD can perform distillation training conditioned on an extremely small scale of synthetic samples (e.g., 10X less than the original training data scale), making the overall training efficiency one or two orders of magnitude faster than many mainstream methods while retaining superior or competitive model performance, as demonstrated on popular image classification and semantic segmentation benchmarks. The code is available at https://github.com/OSVAI/SSD-KD.
Paper Structure (13 sections, 6 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 13 sections, 6 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of knowledge distillation (KD) using original samples vs. synthetic samples, under the same training data scale: 5000 samples (10% of the CIFAR-10 training dataset size). In such small-scale KD regime, the student models (DeepInv deepversionyin2020dreaming and SSD-KD) trained on synthetic samples always show much better accuracy than the counterparts (Vanilla KD kd_hinton) trained on original samples.
  • Figure 2: Comparison of synthetic sample distributions collected from two top-performing adversarial D-KD methods (DeepInv deepversionyin2020dreaming and Fast10 fasterfang) and our SSD-KD on CIFAR-100 dataset. Comparatively, Fig. \ref{['fig:difficulty']} shows that our method can better balance the difficulty distribution of synthetic samples while encouraging the generator to invert more hard samples, and Fig. \ref{['fig:diversity']} further shows that our method can better balance the diversity distribution of synthetic samples across different categories.
  • Figure 3: Comparison of optimization pipelines for existing adversarial D-KD methods including both conventional family chen2019datadfqakdchoi2020datazskt2019deepversionyin2020dreaming (left) and more efficient family cmifang2021contrastivefasterfang (middle), and our SSD-KD (right). Our SSD-KD formulates a reinforcement learning strategy that can flexibly seek appropriate synthetic samples to update a portion of existing samples in a dynamic replay buffer by explicitly measuring their priorities in terms of jointly balancing sample diversity and difficulty distributions. See the method section for notation definitions.
  • Figure 4: Performance comparison of SSD-KD under different synthetic data scales against the original training dataset size, in terms of top-1 classification accuracy (%) and overall training time cost (hour).
  • Figure 5: Visualization examples of synthetic image samples generated by Fast10 and our SSD-KD for the NYUv2 dataset.