Table of Contents
Fetching ...

What Makes a Good Dataset for Knowledge Distillation?

Logan Frank, Jim Davis

TL;DR

The paper tackles knowledge distillation when the teacher's training data is unavailable, a common scenario in continual learning and proprietary models. It systematically evaluates diverse surrogate datasets including real ID/OOD data and unoptimized synthetic data such as OpenGL shader images, Leaves, and noise, under a standard KD framework. The authors identify core criteria for effective KD data—balanced class coverage, diverse imagery, and rich decision-boundary information—and introduce an adversarial perturbation strategy to further improve knowledge transfer. The findings demonstrate that surrogate data can match or nearly match original-data performance and provide practical guidance for data-restricted KD.

Abstract

Knowledge distillation (KD) has been a popular and effective method for model compression. One important assumption of KD is that the teacher's original dataset will also be available when training the student. However, in situations such as continual learning and distilling large models trained on company-withheld datasets, having access to the original data may not always be possible. This leads practitioners towards utilizing other sources of supplemental data, which could yield mixed results. One must then ask: "what makes a good dataset for transferring knowledge from teacher to student?" Many would assume that only real in-domain imagery is viable, but is that the only option? In this work, we explore multiple possible surrogate distillation datasets and demonstrate that many different datasets, even unnatural synthetic imagery, can serve as a suitable alternative in KD. From examining these alternative datasets, we identify and present various criteria describing what makes a good dataset for distillation. Source code is available at https://github.com/osu-cvl/good-kd-dataset.

What Makes a Good Dataset for Knowledge Distillation?

TL;DR

The paper tackles knowledge distillation when the teacher's training data is unavailable, a common scenario in continual learning and proprietary models. It systematically evaluates diverse surrogate datasets including real ID/OOD data and unoptimized synthetic data such as OpenGL shader images, Leaves, and noise, under a standard KD framework. The authors identify core criteria for effective KD data—balanced class coverage, diverse imagery, and rich decision-boundary information—and introduce an adversarial perturbation strategy to further improve knowledge transfer. The findings demonstrate that surrogate data can match or nearly match original-data performance and provide practical guidance for data-restricted KD.

Abstract

Knowledge distillation (KD) has been a popular and effective method for model compression. One important assumption of KD is that the teacher's original dataset will also be available when training the student. However, in situations such as continual learning and distilling large models trained on company-withheld datasets, having access to the original data may not always be possible. This leads practitioners towards utilizing other sources of supplemental data, which could yield mixed results. One must then ask: "what makes a good dataset for transferring knowledge from teacher to student?" Many would assume that only real in-domain imagery is viable, but is that the only option? In this work, we explore multiple possible surrogate distillation datasets and demonstrate that many different datasets, even unnatural synthetic imagery, can serve as a suitable alternative in KD. From examining these alternative datasets, we identify and present various criteria describing what makes a good dataset for distillation. Source code is available at https://github.com/osu-cvl/good-kd-dataset.

Paper Structure

This paper contains 11 sections, 2 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Example OpenGL shader (left column), Leaves (middle column), and noise (right column) images.
  • Figure 2: Visualization of 2D GAP features comparing an MNIST teacher with students trained on various distillation datasets.
  • Figure 3: Decision boundary exploitation adversarial attack in the teacher feature space for an arbitrary class $C$ (left to right). The symbols $\times$, $\blacksquare$, $\blacktriangle$, and $\CIRCLE$ represent the original synthetic examples, "post"-success examples, "pre"-success examples, and "deeper" examples, respectively. Best viewed in color.