Table of Contents
Fetching ...

Teacher as a Lenient Expert: Teacher-Agnostic Data-Free Knowledge Distillation

Hyunjune Shin, Dong-Wan Choi

TL;DR

The paper addresses instability in data-free knowledge distillation (DFKD) arising from the unavailability of validation data and teacher-model sensitivity. It introduces TA-DFKD, which removes the traditional class-prior constraint and employs a teacher-driven sample selection strategy to filter only high-confidence generated samples, enhancing robustness across diverse teacher models. The approach combines a class-prior-free generator loss with adversarial and representation losses, and uses Gaussian Mixture Model-based sample selection and BN-statistics alignment to maintain sample quality and realism. Empirical results on CIFAR-10/100 and TinyImageNet show TA-DFKD achieving superior robustness and stability, outperforming previous DFKD methods across multiple teacher models. This work offers a practical, data-free distillation framework with strong teacher-agnostic guarantees and suggests a promising direction for reliable knowledge transfer without access to real data.

Abstract

Data-free knowledge distillation (DFKD) aims to distill pretrained knowledge to a student model with the help of a generator without using original data. In such data-free scenarios, achieving stable performance of DFKD is essential due to the unavailability of validation data. Unfortunately, this paper has discovered that existing DFKD methods are quite sensitive to different teacher models, occasionally showing catastrophic failures of distillation, even when using well-trained teacher models. Our observation is that the generator in DFKD is not always guaranteed to produce precise yet diverse samples using the existing representative strategy of minimizing both class-prior and adversarial losses. Through our empirical study, we focus on the fact that class-prior not only decreases the diversity of generated samples, but also cannot completely address the problem of generating unexpectedly low-quality samples depending on teacher models. In this paper, we propose the teacher-agnostic data-free knowledge distillation (TA-DFKD) method, with the goal of more robust and stable performance regardless of teacher models. Our basic idea is to assign the teacher model a lenient expert role for evaluating samples, rather than a strict supervisor that enforces its class-prior on the generator. Specifically, we design a sample selection approach that takes only clean samples verified by the teacher model without imposing restrictions on the power of generating diverse samples. Through extensive experiments, we show that our method successfully achieves both robustness and training stability across various teacher models, while outperforming the existing DFKD methods.

Teacher as a Lenient Expert: Teacher-Agnostic Data-Free Knowledge Distillation

TL;DR

The paper addresses instability in data-free knowledge distillation (DFKD) arising from the unavailability of validation data and teacher-model sensitivity. It introduces TA-DFKD, which removes the traditional class-prior constraint and employs a teacher-driven sample selection strategy to filter only high-confidence generated samples, enhancing robustness across diverse teacher models. The approach combines a class-prior-free generator loss with adversarial and representation losses, and uses Gaussian Mixture Model-based sample selection and BN-statistics alignment to maintain sample quality and realism. Empirical results on CIFAR-10/100 and TinyImageNet show TA-DFKD achieving superior robustness and stability, outperforming previous DFKD methods across multiple teacher models. This work offers a practical, data-free distillation framework with strong teacher-agnostic guarantees and suggests a promising direction for reliable knowledge transfer without access to real data.

Abstract

Data-free knowledge distillation (DFKD) aims to distill pretrained knowledge to a student model with the help of a generator without using original data. In such data-free scenarios, achieving stable performance of DFKD is essential due to the unavailability of validation data. Unfortunately, this paper has discovered that existing DFKD methods are quite sensitive to different teacher models, occasionally showing catastrophic failures of distillation, even when using well-trained teacher models. Our observation is that the generator in DFKD is not always guaranteed to produce precise yet diverse samples using the existing representative strategy of minimizing both class-prior and adversarial losses. Through our empirical study, we focus on the fact that class-prior not only decreases the diversity of generated samples, but also cannot completely address the problem of generating unexpectedly low-quality samples depending on teacher models. In this paper, we propose the teacher-agnostic data-free knowledge distillation (TA-DFKD) method, with the goal of more robust and stable performance regardless of teacher models. Our basic idea is to assign the teacher model a lenient expert role for evaluating samples, rather than a strict supervisor that enforces its class-prior on the generator. Specifically, we design a sample selection approach that takes only clean samples verified by the teacher model without imposing restrictions on the power of generating diverse samples. Through extensive experiments, we show that our method successfully achieves both robustness and training stability across various teacher models, while outperforming the existing DFKD methods.
Paper Structure (33 sections, 10 equations, 7 figures, 7 tables)

This paper contains 33 sections, 10 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: (a) Training curves of student models during distillation of two different well-trained teacher models on MNIST when using one or both of class-prior (DAFL) and adversarial learning (DFAD) losses. (b) Peak accuracies of student models distilled from five different teacher models on CIFAR10 when using different DFKD methods including our TA-DFKD method
  • Figure 2: FID scores using a pretrained ResNet-34 model on CIFAR-10 with class-prior's intensity values from 0 to 1.
  • Figure 3: Images of Airplane (top) and Dog (bottom) generated by each trained version of the generator with or without class-prior for a pretrained ResNet-34 model on CIFAR-10.
  • Figure 4: (a) 2D visualization of feature vectors corresponding to real data and synthetic data generated by a generator trained using class-prior without the adversarial loss in ResNet-34 on CIFAR-10, where $\bullet$, $\star$, and $\times$ represent real data samples, high-quality synthetic samples within the boundary of their corresponding real data, and low-quality ones out of their boundary. (b) and (c) show a low-quality synthetic image and its probability distribution, respectively.
  • Figure 5: Overview of the proposed TA-DFKD method.
  • ...and 2 more figures