Table of Contents
Fetching ...

What is Dataset Distillation Learning?

William Yang, Ye Zhu, Zhiwei Deng, Olga Russakovsky

TL;DR

Dataset distillation aims to replace large datasets with compact synthetic data. This work investigates what information distilled data store, whether they can substitute real data, and how to interpret their content, using a combination of predictive, curvature, and influence-function analyses. It finds that distilled data are recognizable by real-data-trained models, reflect early training dynamics, and contain semantic information at the level of individual points, yet they are not faithful substitutes for real data and can be sensitive to training setup. These insights offer a framework for understanding and improving dataset distillation, with implications for efficiency and fairness in condensed-data regimes.

Abstract

Dataset distillation has emerged as a strategy to overcome the hurdles associated with large datasets by learning a compact set of synthetic data that retains essential information from the original dataset. While distilled data can be used to train high performing models, little is understood about how the information is stored. In this study, we posit and answer three questions about the behavior, representativeness, and point-wise information content of distilled data. We reveal distilled data cannot serve as a substitute for real data during training outside the standard evaluation setting for dataset distillation. Additionally, the distillation process retains high task performance by compressing information related to the early training dynamics of real models. Finally, we provide an framework for interpreting distilled data and reveal that individual distilled data points contain meaningful semantic information. This investigation sheds light on the intricate nature of distilled data, providing a better understanding on how they can be effectively utilized.

What is Dataset Distillation Learning?

TL;DR

Dataset distillation aims to replace large datasets with compact synthetic data. This work investigates what information distilled data store, whether they can substitute real data, and how to interpret their content, using a combination of predictive, curvature, and influence-function analyses. It finds that distilled data are recognizable by real-data-trained models, reflect early training dynamics, and contain semantic information at the level of individual points, yet they are not faithful substitutes for real data and can be sensitive to training setup. These insights offer a framework for understanding and improving dataset distillation, with implications for efficiency and fairness in condensed-data regimes.

Abstract

Dataset distillation has emerged as a strategy to overcome the hurdles associated with large datasets by learning a compact set of synthetic data that retains essential information from the original dataset. While distilled data can be used to train high performing models, little is understood about how the information is stored. In this study, we posit and answer three questions about the behavior, representativeness, and point-wise information content of distilled data. We reveal distilled data cannot serve as a substitute for real data during training outside the standard evaluation setting for dataset distillation. Additionally, the distillation process retains high task performance by compressing information related to the early training dynamics of real models. Finally, we provide an framework for interpreting distilled data and reveal that individual distilled data points contain meaningful semantic information. This investigation sheds light on the intricate nature of distilled data, providing a better understanding on how they can be effectively utilized.
Paper Structure (37 sections, 2 equations, 26 figures, 2 tables)

This paper contains 37 sections, 2 equations, 26 figures, 2 tables.

Figures (26)

  • Figure 1: Real vs. distilled data. Real images of airplane, car, and truck from CIFAR-10 krizhevsky2009learning are shown on left and highly salient distilled images of the same classes are shown on the right. While distilled images can be used to train high-accuracy classifiers, why this is possible and what do they represent remains unclear.
  • Figure 2: Pre-trained models recognize distilled data.left. Classification accuracy of four different architectures (bar colors) trained on the real training dataset and evaluated on 100 images distilled using four different distillation algorithms (x-axis). These models successfully recognize distilled data (distribution matching and gradient matching do less well but they are known to distill less information than the other two). right. UMAP mcinnes2018umap visualization of real test images and distilled images using the penultimate features of a ResNet-18 he2016deep model trained on real data. Most of the distilled images lie on the class clusters (indicated by the color), revealing that classification models do interpret distilled images similar to real images.
  • Figure 3: Distilled data is different than real data.left. Kernel density estimation (KDE) plot of pixel intensity of three sample images: a real image, an image distilled with trajectory matching, and an image distilled with distribution matching. Both distilled images contain pixel values outside [0,1]. right. Accuracy of models trained on distilled data and real data mixed together. We train models with 10 distilled images (from four different distillation algorithms; different color lines) combined with a random subset of 0-250 real images per class (x-axis). Adding the real data samples into the training does not substantially benefit -- and even in some cases decreases -- the accuracy of the trained model! In stark contrast is the baseline (dashed line) trained on 10-260 random real images; it significantly improves with more real data.
  • Figure 4: Distribution of prediction agreement on CIFAR-10. Kernel density estimation plots on the number of examples in CIFAR-10 where models that are trained on all of the real data but early stopped or models that are trained on a subset of real data agrees with the model trained on distilled data. The distribution reveals that across all four distillation methods tested, models that are early stopped has a considerable higher number of agreements, indicating that models trained on distilled data predict similarly to models that are early stopped rather than trained on subsets of real data. The similarity with early-stopped models suggests that training on distilled data is analogous to early stopping on real data.
  • Figure 5: Recognition performance on real and distilled data on model trained on real data. We train a model on real data for 300 iterations and evaluate the model's evaluation accuracy at every iteration of the training. The plot shows classification accuracy on BPTT, distribution matching, gradient matching, and trajectory matching distilled data stops improving after iteration 150 but the classification accuracy on real test still improve. The lack of improvement of classification accuracy on distilled data shows the information that the model learns relevant to correctly classifying the distilled data exists only in the early iterations of training on real data. Therefore, this suggests that distilled data stores information regarding the early training dynamics of real data.
  • ...and 21 more figures