Table of Contents
Fetching ...

SelMatch: Effectively Scaling Up Dataset Distillation via Selection-Based Initialization and Partial Updates by Trajectory Matching

Yongmin Lee, Hye Won Chung

TL;DR

SelMatch is introduced, a novel distillation method that effectively scales with IPC that uses selection-based initialization and partial updates through trajectory matching to manage the synthetic dataset's desired difficulty level tailored to IPC scales.

Abstract

Dataset distillation aims to synthesize a small number of images per class (IPC) from a large dataset to approximate full dataset training with minimal performance loss. While effective in very small IPC ranges, many distillation methods become less effective, even underperforming random sample selection, as IPC increases. Our examination of state-of-the-art trajectory-matching based distillation methods across various IPC scales reveals that these methods struggle to incorporate the complex, rare features of harder samples into the synthetic dataset even with the increased IPC, resulting in a persistent coverage gap between easy and hard test samples. Motivated by such observations, we introduce SelMatch, a novel distillation method that effectively scales with IPC. SelMatch uses selection-based initialization and partial updates through trajectory matching to manage the synthetic dataset's desired difficulty level tailored to IPC scales. When tested on CIFAR-10/100 and TinyImageNet, SelMatch consistently outperforms leading selection-only and distillation-only methods across subset ratios from 5% to 30%.

SelMatch: Effectively Scaling Up Dataset Distillation via Selection-Based Initialization and Partial Updates by Trajectory Matching

TL;DR

SelMatch is introduced, a novel distillation method that effectively scales with IPC that uses selection-based initialization and partial updates through trajectory matching to manage the synthetic dataset's desired difficulty level tailored to IPC scales.

Abstract

Dataset distillation aims to synthesize a small number of images per class (IPC) from a large dataset to approximate full dataset training with minimal performance loss. While effective in very small IPC ranges, many distillation methods become less effective, even underperforming random sample selection, as IPC increases. Our examination of state-of-the-art trajectory-matching based distillation methods across various IPC scales reveals that these methods struggle to incorporate the complex, rare features of harder samples into the synthetic dataset even with the increased IPC, resulting in a persistent coverage gap between easy and hard test samples. Motivated by such observations, we introduce SelMatch, a novel distillation method that effectively scales with IPC. SelMatch uses selection-based initialization and partial updates through trajectory matching to manage the synthetic dataset's desired difficulty level tailored to IPC scales. When tested on CIFAR-10/100 and TinyImageNet, SelMatch consistently outperforms leading selection-only and distillation-only methods across subset ratios from 5% to 30%.

Paper Structure

This paper contains 41 sections, 3 equations, 15 figures, 8 tables, 1 algorithm.

Figures (15)

  • Figure 1: (a) (left) Overall coverage and (right) coverage of easy vs. hard groups with varying IPC. We observe that coverage by MTT saturates as IPC increases, especially for the hard group. SelMatch (our method) exhibits superior overall coverage, with marked improvements for the hard group. (b) Coverage by MTT decreases rapidly as the distillation proceeds, while that with SelMatch remains stable. (c) Test accuracy on easy vs. hard groups with varying IPC. With MTT, the test accuracy for the hard group eventually aligns with that achieved by random selection as IPC increases. All this findings indicate that traditional MTT overly focuses on synthesizing easy features, leading to saturation in both coverage and test accuracy even with higher IPCs. In contrast, our method, SelMatch, achieves effective scaling with IPC, enhancing coverage for both easy and hard samples and consequently achieving superior test accuracy.
  • Figure 2: Illustration of our select-and-match method, SelMatch. Our method comprises two key components: 1) Selection-based initialization: SelMatch employs our sliding-window algorithm to select a subset of a suitable difficulty level, initializing the synthetic dataset $\mathcal{D}_\textrm{syn}$ with this chosen subset; 2) Partial update: SelMatch freezes (1-$\alpha$) fraction of samples ($\mathcal{D}_\textrm{select}$) and update only $\alpha$ fraction of samples ($\mathcal{D}_\textrm{distill}$) while minimizing the matching loss $\mathcal{L}( \mathcal{D}_\textrm{select}\cup \mathcal{D}_\textrm{distill}, \mathcal{D}_\textrm{real})$ to preserve unique features of selected real samples.
  • Figure 3: The result of sliding window experiment on CIFAR-10 with varying subset size (5 to 30%). Dashed horizontal lines indicate the accuracy of the models trained by randomly selected subsets of the corresponding size. Solid lines indicate the accuracy of the models trained by a window subset of samples ordered by their difficulty scores (from hardest to easiest by c-score cscore) with varying window starting point $\beta$%.
  • Figure 4: (a) Ablation on distillation portion $\alpha$ in synthetic dataset for CIFAR-100 with varying IPC. Optimal $\alpha$ tends to decrease as IPC increases. (b) Ablation on augmentation strategy on CIFAR-100 with IPC 50. The result shows the effectiveness of our combined augmentation technique. (c) Ablation on batch normalization on CIFAR-100 with IPC=50. Employing Batch Normalization for both distillation and evaluation exhibits the best performance. (d) Ablation on max start epoch $T^+$ on CIFAR-100 with IPC=50, 100. The result indicates that utilizing later epochs enhances performance in large IPC regime.
  • Figure 5: Analysis of SelMatch on CIFAR-100 with IPC=50. (a) T-SNE visualization of (left) MTT and (right) SelMatch. Small red, green, blue points represent real samples (test set) of the first three classes of CIFAR-100. Large circles indicate samples in the synthetic dataset. For SelMatch, unaltered samples ($\mathcal{D}_\textrm{select}$) are denoted as 'X' marker with darker colors. We observe that samples in $\mathcal{D}_\textrm{select}$ are located closer to the decision boundary compared to $\mathcal{D}_\textrm{distill}$. (b) Evolution of $\ell_2$ norm of network gradients on $\mathcal{D}_\textrm{select}$ and $\mathcal{D}_\textrm{distill}$. The gradient norm on $\mathcal{D}_\textrm{select}$ is larger than $\mathcal{D}_\textrm{distill}$. Note that network is trained on entire synthetic set $\mathcal{D}_\textrm{syn} = \mathcal{D}_\textrm{select} \cup \mathcal{D}_\textrm{distill}$.
  • ...and 10 more figures