A Label is Worth a Thousand Images in Dataset Distillation

Tian Qin; Zhiwei Deng; David Alvarez-Melis

A Label is Worth a Thousand Images in Dataset Distillation

Tian Qin, Zhiwei Deng, David Alvarez-Melis

TL;DR

Data quality, not just quantity, governs machine learning performance, and dataset distillation seeks to compress a target dataset $\mathcal{D}_{target}$ into a much smaller $\mathcal{D}_{syn}$ while preserving downstream accuracy. Surprisingly, the authors find that the key factor is soft labels rather than synthetic images, with structured information in those labels driving data-efficient learning. They show a simple soft-label baseline using randomly sampled images and pretrained experts that matches ensemble-based distillation across ImageNet-1K and smaller datasets, and they reveal that expert knowledge can be traded for data via a data-knowledge scaling law and a Pareto frontier. They also demonstrate that soft-label quality can be enhanced by expert ensembles or learned via distillation methods like truncated-BPTT, which can reproduce ensemble-like labels without explicit experts. Overall, the work challenges conventional distillation strategies and identifies soft-label design as a central lever for improving data-efficient learning, with implications for future data-centric methods and KD-like techniques across domains.

Abstract

Data $\textit{quality}$ is a crucial factor in the performance of machine learning models, a principle that dataset distillation methods exploit by compressing training datasets into much smaller counterparts that maintain similar downstream performance. Understanding how and why data distillation methods work is vital not only for improving these methods but also for revealing fundamental characteristics of "good" training data. However, a major challenge in achieving this goal is the observation that distillation approaches, which rely on sophisticated but mostly disparate methods to generate synthetic data, have little in common with each other. In this work, we highlight a largely overlooked aspect common to most of these methods: the use of soft (probabilistic) labels. Through a series of ablation experiments, we study the role of soft labels in depth. Our results reveal that the main factor explaining the performance of state-of-the-art distillation methods is not the specific techniques used to generate synthetic data but rather the use of soft labels. Furthermore, we demonstrate that not all soft labels are created equal; they must contain $\textit{structured information}$ to be beneficial. We also provide empirical scaling laws that characterize the effectiveness of soft labels as a function of images-per-class in the distilled dataset and establish an empirical Pareto frontier for data-efficient learning. Combined, our findings challenge conventional wisdom in dataset distillation, underscore the importance of soft labels in learning, and suggest new directions for improving distillation methods. Code for all experiments is available at https://github.com/sunnytqin/no-distillation.

A Label is Worth a Thousand Images in Dataset Distillation

TL;DR

Data quality, not just quantity, governs machine learning performance, and dataset distillation seeks to compress a target dataset

into a much smaller

while preserving downstream accuracy. Surprisingly, the authors find that the key factor is soft labels rather than synthetic images, with structured information in those labels driving data-efficient learning. They show a simple soft-label baseline using randomly sampled images and pretrained experts that matches ensemble-based distillation across ImageNet-1K and smaller datasets, and they reveal that expert knowledge can be traded for data via a data-knowledge scaling law and a Pareto frontier. They also demonstrate that soft-label quality can be enhanced by expert ensembles or learned via distillation methods like truncated-BPTT, which can reproduce ensemble-like labels without explicit experts. Overall, the work challenges conventional distillation strategies and identifies soft-label design as a central lever for improving data-efficient learning, with implications for future data-centric methods and KD-like techniques across domains.

Abstract

Data

is a crucial factor in the performance of machine learning models, a principle that dataset distillation methods exploit by compressing training datasets into much smaller counterparts that maintain similar downstream performance. Understanding how and why data distillation methods work is vital not only for improving these methods but also for revealing fundamental characteristics of "good" training data. However, a major challenge in achieving this goal is the observation that distillation approaches, which rely on sophisticated but mostly disparate methods to generate synthetic data, have little in common with each other. In this work, we highlight a largely overlooked aspect common to most of these methods: the use of soft (probabilistic) labels. Through a series of ablation experiments, we study the role of soft labels in depth. Our results reveal that the main factor explaining the performance of state-of-the-art distillation methods is not the specific techniques used to generate synthetic data but rather the use of soft labels. Furthermore, we demonstrate that not all soft labels are created equal; they must contain

to be beneficial. We also provide empirical scaling laws that characterize the effectiveness of soft labels as a function of images-per-class in the distilled dataset and establish an empirical Pareto frontier for data-efficient learning. Combined, our findings challenge conventional wisdom in dataset distillation, underscore the importance of soft labels in learning, and suggest new directions for improving distillation methods. Code for all experiments is available at https://github.com/sunnytqin/no-distillation.

Paper Structure (41 sections, 3 equations, 12 figures, 7 tables, 1 algorithm)

This paper contains 41 sections, 3 equations, 12 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Soft Labels are Crucial for Distillation
Background on dataset distillation and a simple soft label baseline
Benchmarking distillation methods against the soft label baseline
Good Soft Labels Require Structured Information
General observations from the soft label baseline
Structured information in soft labels and its importance for distillation
Label swapping test.
Effect of temperature and epoch in soft labels.
Trading data with knowledge
Scaling law.
Learning from (almost) no data.
Soft Labels are Not Created Equal
Expert ensemble
...and 26 more sections

Figures (12)

Figure 1: Soft labels are crucial for dataset distillationLeft: Synthetic images by different distillation methods. Right: Student test Accuracy comparison between different distillation methods and random baseline from training data (both with hard labels or with soft labels). For soft/hard label generation details, see Appendix \ref{['appdx:label_generation']}.
Figure 2: Expert test accuracy v.s. student test accuracy v.s. soft label entropy. The quality of soft labels (measured by student accuracy) depends on expert accuracy (left) and label entropy (right).
Figure 3: Importance of $i$-th label by performing label swapping test. Swap the $i$-th label (sorted by softmax value) with the last label. Top labels contain structured information and the non-top labels contain noise.
Figure 4: (Expert Epoch, Temp) grid search on TinyImageNet IPC=10. Temperature smoothing does not fully resolve the issue that later epoch experts yield sub-optimal labels for a given data budget.
Figure 5: Soft labels generated by expert at different epochs. The structured information in soft labels changes over the course of training.
...and 7 more figures

A Label is Worth a Thousand Images in Dataset Distillation

TL;DR

Abstract

A Label is Worth a Thousand Images in Dataset Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (12)