Table of Contents
Fetching ...

Self-Supervised Dataset Distillation for Transfer Learning

Dong Bok Lee, Seanie Lee, Joonho Ko, Kenji Kawaguchi, Juho Lee, Sung Ju Hwang

TL;DR

This work tackles unsupervised dataset distillation for transfer learning by compressing unlabeled data into a small synthetic set for self-supervised pre-training. It identifies that naive SSL gradients in bilevel optimization are biased due to data augmentations and replaces the inner/outer objectives with deterministic MSE-based formulations, incorporating a pool of feature extractors and kernel ridge regression to keep computations tractable. The proposed method, Kernel Ridge Regression on Self-supervised Target (KRR-ST), yields significant improvements over supervised dataset distillation baselines in transfer learning, architecture generalization, and targeted data-free KD. Overall, the approach enables efficient SSL pre-training with strong cross-domain transfer while maintaining computational practicality, making it attractive for NAS, continual learning, and privacy-preserving KD scenarios.

Abstract

Dataset distillation methods have achieved remarkable success in distilling a large dataset into a small set of representative samples. However, they are not designed to produce a distilled dataset that can be effectively used for facilitating self-supervised pre-training. To this end, we propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL). We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is \textit{biased} due to the randomness originating from data augmentations or masking. To address this issue, we propose to minimize the mean squared error (MSE) between a model's representations of the synthetic examples and their corresponding learnable target feature representations for the inner objective, which does not introduce any randomness. Our primary motivation is that the model obtained by the proposed inner optimization can mimic the \textit{self-supervised target model}. To achieve this, we also introduce the MSE between representations of the inner model and the self-supervised target model on the original full dataset for outer optimization. Lastly, assuming that a feature extractor is fixed, we only optimize a linear head on top of the feature extractor, which allows us to reduce the computational cost and obtain a closed-form solution of the head with kernel ridge regression. We empirically validate the effectiveness of our method on various applications involving transfer learning.

Self-Supervised Dataset Distillation for Transfer Learning

TL;DR

This work tackles unsupervised dataset distillation for transfer learning by compressing unlabeled data into a small synthetic set for self-supervised pre-training. It identifies that naive SSL gradients in bilevel optimization are biased due to data augmentations and replaces the inner/outer objectives with deterministic MSE-based formulations, incorporating a pool of feature extractors and kernel ridge regression to keep computations tractable. The proposed method, Kernel Ridge Regression on Self-supervised Target (KRR-ST), yields significant improvements over supervised dataset distillation baselines in transfer learning, architecture generalization, and targeted data-free KD. Overall, the approach enables efficient SSL pre-training with strong cross-domain transfer while maintaining computational practicality, making it attractive for NAS, continual learning, and privacy-preserving KD scenarios.

Abstract

Dataset distillation methods have achieved remarkable success in distilling a large dataset into a small set of representative samples. However, they are not designed to produce a distilled dataset that can be effectively used for facilitating self-supervised pre-training. To this end, we propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL). We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is \textit{biased} due to the randomness originating from data augmentations or masking. To address this issue, we propose to minimize the mean squared error (MSE) between a model's representations of the synthetic examples and their corresponding learnable target feature representations for the inner objective, which does not introduce any randomness. Our primary motivation is that the model obtained by the proposed inner optimization can mimic the \textit{self-supervised target model}. To achieve this, we also introduce the MSE between representations of the inner model and the self-supervised target model on the original full dataset for outer optimization. Lastly, assuming that a feature extractor is fixed, we only optimize a linear head on top of the feature extractor, which allows us to reduce the computational cost and obtain a closed-form solution of the head with kernel ridge regression. We empirically validate the effectiveness of our method on various applications involving transfer learning.
Paper Structure (32 sections, 1 theorem, 10 equations, 7 figures, 12 tables, 1 algorithm)

This paper contains 32 sections, 1 theorem, 10 equations, 7 figures, 12 tables, 1 algorithm.

Key Result

Theorem 1

The derivative $\frac{\partial \mathcal{L}_\text{SSL}(\hat{\theta}(X_s); X_t)}{\partial X_s }$ is a biased estimator of $\frac{\partial \mathcal{L}_\text{SSL}(\theta^*(X_s); X_t)}{\partial X_s }$, i.e., $\mathbb{E}_\zeta[\frac{\partial \mathcal{L}_\text{SSL}(\hat{\theta}(X_s); X_t)}{\partial X_s }]\

Figures (7)

  • Figure 1: (a): Previous supervised dataset distillation methods. (b): Our proposed method that distills unlabeled dataset into a small set that can be effectively used for pre-training and transfer to target datasets.
  • Figure 2: Visualization of the distilled images, their feature representation and corresponding distilled labels in the output space of the target model. All distilled images are provided in Appendix \ref{['appendix:visualization']}.
  • Figure 3: The results of architecture generalization. ConvNet4 is utilized for condensing TinyImageNet into 2,000 synthetic examples. Models with different architectures are pre-trained on the condensed dataset and fine-tuned on target datasets. We report the average and standard deviation over three runs. The above results are reported as a tabular format in Appendix \ref{['appendix:additional_exp']}.
  • Figure 4: Visualization of the synthetic images distilled by our method in CIFAR100.
  • Figure 5: Visualization of the synthetic images distilled by our method in TinyImageNet.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Theorem 1