Table of Contents
Fetching ...

Dataset Distillation for Pre-Trained Self-Supervised Vision Models

George Cazenavette, Antonio Torralba, Vincent Sitzmann

TL;DR

The paper tackles dataset distillation for pre-trained self-supervised vision models, proposing Linear Gradient Matching to synthesize tiny datasets that drive training updates in a linear head to mirror those from real data. The method optimizes a meta-loss $L_{meta} = 1 - \cos\left( vec\left( \frac{\partial \ell_{real}}{\partial W} \right), vec\left( \frac{\partial \ell_{syn}}{\partial W} \right) \right)$ and is enhanced with a pyramid representation, color decorrelation, and differentiable augmentations. Empirically, a single image per class distilled with various backbones (e.g., DINO-v2, CLIP, EVA-02, MoCo-v3) yields competitive or superior performance to real-data baselines, with strong cross-model transfer and notable gains in fine-grained and out-of-distribution scenarios. The work highlights the potential of distilled data as a tool for model interpretability, alignment diagnostics, and substantial reductions in training time and resources, while providing practical code and datasets for further investigation.

Abstract

The task of dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples. Existing distillation methods focus on synthesizing datasets that enable training randomly initialized models. In contrast, state-of-the-art vision approaches are increasingly building on large, pre-trained self-supervised models rather than training from scratch. In this paper, we investigate the problem of distilling datasets that enable us to optimally train linear probes on top of such large, pre-trained vision models. We introduce a method of dataset distillation for this task called Linear Gradient Matching that optimizes the synthetic images such that, when passed through a pre-trained feature extractor, they induce gradients in the linear classifier similar to those produced by the real data. Our method yields synthetic data that outperform all real-image baselines and, remarkably, generalize across pre-trained vision models, enabling us, for instance, to train a linear CLIP probe that performs competitively using a dataset distilled via a DINO backbone. Further, we show that our distilled datasets are exceptionally effective for fine-grained classification and provide a valuable tool for model interpretability, predicting, among other things, how similar two models' embedding spaces are under the platonic representation hypothesis or whether a model is sensitive to spurious correlations in adversarial datasets.

Dataset Distillation for Pre-Trained Self-Supervised Vision Models

TL;DR

The paper tackles dataset distillation for pre-trained self-supervised vision models, proposing Linear Gradient Matching to synthesize tiny datasets that drive training updates in a linear head to mirror those from real data. The method optimizes a meta-loss and is enhanced with a pyramid representation, color decorrelation, and differentiable augmentations. Empirically, a single image per class distilled with various backbones (e.g., DINO-v2, CLIP, EVA-02, MoCo-v3) yields competitive or superior performance to real-data baselines, with strong cross-model transfer and notable gains in fine-grained and out-of-distribution scenarios. The work highlights the potential of distilled data as a tool for model interpretability, alignment diagnostics, and substantial reductions in training time and resources, while providing practical code and datasets for further investigation.

Abstract

The task of dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples. Existing distillation methods focus on synthesizing datasets that enable training randomly initialized models. In contrast, state-of-the-art vision approaches are increasingly building on large, pre-trained self-supervised models rather than training from scratch. In this paper, we investigate the problem of distilling datasets that enable us to optimally train linear probes on top of such large, pre-trained vision models. We introduce a method of dataset distillation for this task called Linear Gradient Matching that optimizes the synthetic images such that, when passed through a pre-trained feature extractor, they induce gradients in the linear classifier similar to those produced by the real data. Our method yields synthetic data that outperform all real-image baselines and, remarkably, generalize across pre-trained vision models, enabling us, for instance, to train a linear CLIP probe that performs competitively using a dataset distilled via a DINO backbone. Further, we show that our distilled datasets are exceptionally effective for fine-grained classification and provide a valuable tool for model interpretability, predicting, among other things, how similar two models' embedding spaces are under the platonic representation hypothesis or whether a model is sensitive to spurious correlations in adversarial datasets.

Paper Structure

This paper contains 34 sections, 4 equations, 28 figures, 7 tables.

Figures (28)

  • Figure 1: ImageNet-1k Distilled for Self-Supervised Models: Using our method of Linear Gradient Matching, we distill vision datasets to just one synthetic image per class using different pre-trained self-supervised backbone models. These learned images can then be used to train linear probes that achieve high accuracy on unseen test data, outperforming all real-image baselines. Furthermore, each backbone model seems to yield its own "style" of distilled image, giving insights into the aspects on which these models tend to focus (structure, texture, color, etc.).
  • Figure 2: Linear Gradient Matching for Pre-Trained Vision Models: Given a pre-trained self-supervised vision model ($\phi$), we perform a distillation step by first passing a batch of real and synthetic data through $\phi$ and a randomly-initialized linear classifier ($W$) to get the real and synthetic classification losses ($\ell_\text{real}$ and $\ell_\text{syn}$). Our meta loss ($\mathcal{L}_\text{meta}$) is then defined as the cosine distance between the gradients of these classification losses ($\ell_\text{real}$ and $\ell_\text{syn}$) with respect to the random linear probe ($W$). This meta loss is then back-propagated through the initial synthetic gradient calculation and used to update our synthetic images. This technique allows us to distill large datasets to just a single image per class while still achieving high performance when training new linear probes.
  • Figure 3: Performing more rounds of differentiable augmentation on the synthetic data during each distillation step improves both the single-model and cross-model performance of the distilled images.
  • Figure 3: Evaluating Ablations: While all three components provide improvements, the Augmentation has the most dramatic effect, especially in the cross-model setting. Likewise, the Pyramid optimization seems to matter more in the cross-model setting than the same-model setting by mitigating overfitting to the model used during distillation.
  • Figure 4: We distill ImageNet-Fruits and observe the PCA of the training image embeddings. Each color represents a class. Note that the distilled images typically lie on the edge or outside of their class's cluster.
  • ...and 23 more figures