Table of Contents
Fetching ...

Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching

Ziyao Guo, Kai Wang, George Cazenavette, Hui Li, Kaipeng Zhang, Yang You

TL;DR

This work proposes to align the difficulty of the generated patterns with the size of the synthetic dataset, and successfully scale trajectory matching-based methods to larger synthetic datasets, achieving lossless dataset distillation for the first time.

Abstract

The ultimate goal of Dataset Distillation is to synthesize a small synthetic dataset such that a model trained on this synthetic set will perform equally well as a model trained on the full, real dataset. Until now, no method of Dataset Distillation has reached this completely lossless goal, in part due to the fact that previous methods only remain effective when the total number of synthetic samples is extremely small. Since only so much information can be contained in such a small number of samples, it seems that to achieve truly loss dataset distillation, we must develop a distillation method that remains effective as the size of the synthetic dataset grows. In this work, we present such an algorithm and elucidate why existing methods fail to generate larger, high-quality synthetic sets. Current state-of-the-art methods rely on trajectory-matching, or optimizing the synthetic data to induce similar long-term training dynamics as the real data. We empirically find that the training stage of the trajectories we choose to match (i.e., early or late) greatly affects the effectiveness of the distilled dataset. Specifically, early trajectories (where the teacher network learns easy patterns) work well for a low-cardinality synthetic set since there are fewer examples wherein to distribute the necessary information. Conversely, late trajectories (where the teacher network learns hard patterns) provide better signals for larger synthetic sets since there are now enough samples to represent the necessary complex patterns. Based on our findings, we propose to align the difficulty of the generated patterns with the size of the synthetic dataset. In doing so, we successfully scale trajectory matching-based methods to larger synthetic datasets, achieving lossless dataset distillation for the very first time. Code and distilled datasets are available at https://gzyaftermath.github.io/DATM.

Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching

TL;DR

This work proposes to align the difficulty of the generated patterns with the size of the synthetic dataset, and successfully scale trajectory matching-based methods to larger synthetic datasets, achieving lossless dataset distillation for the first time.

Abstract

The ultimate goal of Dataset Distillation is to synthesize a small synthetic dataset such that a model trained on this synthetic set will perform equally well as a model trained on the full, real dataset. Until now, no method of Dataset Distillation has reached this completely lossless goal, in part due to the fact that previous methods only remain effective when the total number of synthetic samples is extremely small. Since only so much information can be contained in such a small number of samples, it seems that to achieve truly loss dataset distillation, we must develop a distillation method that remains effective as the size of the synthetic dataset grows. In this work, we present such an algorithm and elucidate why existing methods fail to generate larger, high-quality synthetic sets. Current state-of-the-art methods rely on trajectory-matching, or optimizing the synthetic data to induce similar long-term training dynamics as the real data. We empirically find that the training stage of the trajectories we choose to match (i.e., early or late) greatly affects the effectiveness of the distilled dataset. Specifically, early trajectories (where the teacher network learns easy patterns) work well for a low-cardinality synthetic set since there are fewer examples wherein to distribute the necessary information. Conversely, late trajectories (where the teacher network learns hard patterns) provide better signals for larger synthetic sets since there are now enough samples to represent the necessary complex patterns. Based on our findings, we propose to align the difficulty of the generated patterns with the size of the synthetic dataset. In doing so, we successfully scale trajectory matching-based methods to larger synthetic datasets, achieving lossless dataset distillation for the very first time. Code and distilled datasets are available at https://gzyaftermath.github.io/DATM.
Paper Structure (27 sections, 4 equations, 14 figures, 6 tables, 1 algorithm)

This paper contains 27 sections, 4 equations, 14 figures, 6 tables, 1 algorithm.

Figures (14)

  • Figure 1: (a) Illustration of the objective of dataset distillation. (b) The optimization in dataset distillation can be viewed as the process of generating informative patterns on the synthetic dataset. (c) We align the difficulty of the synthetic patterns with the size of the distilled dataset, to enable our method to perform well in both small and large IPC regimes. (d) Comparison of the performance of multiple dataset distillation methods on CIFAR-10 with different IPC. As IPC increases, the performance of previous methods becomes worse than random selection.
  • Figure 2: We train expert models on CIFAR-10 for 40 epochs. Then the distillation is performed under different IPC settings by matching either early trajectories $\{\theta^*_t|0\leq t \leq 20\}$, late trajectories $\{\theta^*_t|20 \leq t \leq 40\}$, or all trajectories $\{\theta^*_t|0\leq t \leq 40\}$. As IPC increases, matching late trajectories becomes beneficial while matching early trajectories tends to be harmful.
  • Figure 3: (a): (CIFAR-100, IPC=10) Synthetic datasets are initialized by randomly sampling data from the original dataset (random selection) or a subset of data that can be correctly classified (ours). Our strategy makes the optimization converge faster. (b): (CIFAR-10, IPC=50) Ablation on learning soft labels, where soft labels are initialized with expert models trained after different epochs. Learning labels relieves us from carefully selecting the labeling expert. (c): (CIFAR-10) The optimization with higher IPC converges in fewer iterations.
  • Figure 4: We perform the distillation on CIFAR-10 with IPC=50 by matching either early trajectories ${\{\theta_t|0\leq t \leq 10\}}$ or late trajectories $\{\theta_t|30\leq t \leq 40\}$. All synthetic images are optimized 1000 times. Matching earlier trajectories will blur the details of the target object and change the color more drastically.
  • Figure 5: Visualization of the synthetic datasets distilled with different IPC settings. As IPC increases, synthetic images move less far from their initialization.
  • ...and 9 more figures