Table of Contents
Fetching ...

Sampling to Distill: Knowledge Transfer from Open-World Data

Yuzheng Wang, Zhaoyu Chen, Jie Zhang, Dingkang Yang, Zuhao Ge, Yang Liu, Siao Liu, Yunquan Sun, Wenqiang Zhang, Lizhe Qi

TL;DR

This work tackles data-free knowledge distillation by removing the reliance on data-generating modules and addressing domain shift through open-world data. It introduces Open-world Data Sampling Distillation (ODSD), which pairs Adaptive Prototype Sampling (APS) with Denoising Contrastive Relational Distillation (DCRD) to learn from structured relationships in unlabeled data while suppressing label noise. The method achieves state-of-the-art results on CIFAR-10/100, NYUv2, and ImageNet with lower FLOPs and fewer parameters than generation-based approaches, including up to a 9.59 percentage-point gain on ImageNet in cross-backbone scenarios. Practically, ODSD reduces computational waste by avoiding per-class generators and demonstrates robust knowledge transfer through both data-level and teacher-student relational signals, improving generalization in open-world settings.

Abstract

Data-Free Knowledge Distillation (DFKD) is a novel task that aims to train high-performance student models using only the pre-trained teacher network without original training data. Most of the existing DFKD methods rely heavily on additional generation modules to synthesize the substitution data resulting in high computational costs and ignoring the massive amounts of easily accessible, low-cost, unlabeled open-world data. Meanwhile, existing methods ignore the domain shift issue between the substitution data and the original data, resulting in knowledge from teachers not always trustworthy and structured knowledge from data becoming a crucial supplement. To tackle the issue, we propose a novel Open-world Data Sampling Distillation (ODSD) method for the DFKD task without the redundant generation process. First, we try to sample open-world data close to the original data's distribution by an adaptive sampling module and introduce a low-noise representation to alleviate the domain shift issue. Then, we build structured relationships of multiple data examples to exploit data knowledge through the student model itself and the teacher's structured representation. Extensive experiments on CIFAR-10, CIFAR-100, NYUv2, and ImageNet show that our ODSD method achieves state-of-the-art performance with lower FLOPs and parameters. Especially, we improve 1.50\%-9.59\% accuracy on the ImageNet dataset and avoid training the separate generator for each class.

Sampling to Distill: Knowledge Transfer from Open-World Data

TL;DR

This work tackles data-free knowledge distillation by removing the reliance on data-generating modules and addressing domain shift through open-world data. It introduces Open-world Data Sampling Distillation (ODSD), which pairs Adaptive Prototype Sampling (APS) with Denoising Contrastive Relational Distillation (DCRD) to learn from structured relationships in unlabeled data while suppressing label noise. The method achieves state-of-the-art results on CIFAR-10/100, NYUv2, and ImageNet with lower FLOPs and fewer parameters than generation-based approaches, including up to a 9.59 percentage-point gain on ImageNet in cross-backbone scenarios. Practically, ODSD reduces computational waste by avoiding per-class generators and demonstrates robust knowledge transfer through both data-level and teacher-student relational signals, improving generalization in open-world settings.

Abstract

Data-Free Knowledge Distillation (DFKD) is a novel task that aims to train high-performance student models using only the pre-trained teacher network without original training data. Most of the existing DFKD methods rely heavily on additional generation modules to synthesize the substitution data resulting in high computational costs and ignoring the massive amounts of easily accessible, low-cost, unlabeled open-world data. Meanwhile, existing methods ignore the domain shift issue between the substitution data and the original data, resulting in knowledge from teachers not always trustworthy and structured knowledge from data becoming a crucial supplement. To tackle the issue, we propose a novel Open-world Data Sampling Distillation (ODSD) method for the DFKD task without the redundant generation process. First, we try to sample open-world data close to the original data's distribution by an adaptive sampling module and introduce a low-noise representation to alleviate the domain shift issue. Then, we build structured relationships of multiple data examples to exploit data knowledge through the student model itself and the teacher's structured representation. Extensive experiments on CIFAR-10, CIFAR-100, NYUv2, and ImageNet show that our ODSD method achieves state-of-the-art performance with lower FLOPs and parameters. Especially, we improve 1.50\%-9.59\% accuracy on the ImageNet dataset and avoid training the separate generator for each class.
Paper Structure (14 sections, 8 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 14 sections, 8 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparison of (a) generation-based and (b) sampling-based methods. The sampling-based process utilizes the open-world unlabeled data to distill the student network, so it does not need additional generation costs. At the same time, the extra knowledge in these unlabeled data enriches the knowledge representation from the teacher.
  • Figure 2: The pipeline of our proposed ODSD. First, all open-world unlabeled data passes through adaptive prototype sampling so that the substitute dataset resembles the distribution of the original data. Then, based on these data, the student can make progress through low-noise information representation, data knowledge mining, and structured knowledge from the teacher.
  • Figure 3: Visualization segmentation results on the NYUv2 dataset.
  • Figure 4: t-SNE visualization of the data distributions on CIFAR-100 and ImageNet datasets. Red dots denote original domain data, while blue dots denote unlabeled sampling data. The distance between dot groups reflects the similarity between data domains. The data sampled by our APS method is more similar to that of the original domain, effectively reducing domain noise and improving learning performance.