Table of Contents
Fetching ...

Elucidating the Design Space of Dataset Condensation

Shitong Shao, Zikai Zhou, Huanran Chen, Zhiqiang Shen

TL;DR

This paper addresses the high computational burden and limited design space of existing dataset condensation methods by introducing Elucidate Dataset Condensation (EDC), a comprehensive framework that optimizes data synthesis, soft label generation, and post-evaluation. Its key innovations include real-image initialization, soft category-aware matching that blends mean/variance and Gaussian Mixture Model statistics, flatness regularization via EMA-based SAM on logits, a smoothing learning-rate schedule, small batch sizes, and EMA-based evaluation. Empirically, EDC achieves state-of-the-art results across CIFAR, Tiny-ImageNet, and ImageNet variants, notably reaching $48.6\%$ top-1 on ImageNet-1k with IPC $10$ using ResNet-18 and outperforming prior methods by substantial margins. The work also demonstrates cross-architecture generalization, ablations confirming the contributions of each design choice, and scalability to larger datasets, underscoring its practical impact for efficient data-centric learning.

Abstract

Dataset condensation, a concept within data-centric learning, efficiently transfers critical attributes from an original dataset to a synthetic version, maintaining both diversity and realism. This approach significantly improves model training efficiency and is adaptable across multiple application areas. Previous methods in dataset condensation have faced challenges: some incur high computational costs which limit scalability to larger datasets (e.g., MTT, DREAM, and TESLA), while others are restricted to less optimal design spaces, which could hinder potential improvements, especially in smaller datasets (e.g., SRe2L, G-VBSM, and RDED). To address these limitations, we propose a comprehensive design framework that includes specific, effective strategies like implementing soft category-aware matching and adjusting the learning rate schedule. These strategies are grounded in empirical evidence and theoretical backing. Our resulting approach, Elucidate Dataset Condensation (EDC), establishes a benchmark for both small and large-scale dataset condensation. In our testing, EDC achieves state-of-the-art accuracy, reaching 48.6% on ImageNet-1k with a ResNet-18 model at an IPC of 10, which corresponds to a compression ratio of 0.78%. This performance exceeds those of SRe2L, G-VBSM, and RDED by margins of 27.3%, 17.2%, and 6.6%, respectively.

Elucidating the Design Space of Dataset Condensation

TL;DR

This paper addresses the high computational burden and limited design space of existing dataset condensation methods by introducing Elucidate Dataset Condensation (EDC), a comprehensive framework that optimizes data synthesis, soft label generation, and post-evaluation. Its key innovations include real-image initialization, soft category-aware matching that blends mean/variance and Gaussian Mixture Model statistics, flatness regularization via EMA-based SAM on logits, a smoothing learning-rate schedule, small batch sizes, and EMA-based evaluation. Empirically, EDC achieves state-of-the-art results across CIFAR, Tiny-ImageNet, and ImageNet variants, notably reaching top-1 on ImageNet-1k with IPC using ResNet-18 and outperforming prior methods by substantial margins. The work also demonstrates cross-architecture generalization, ablations confirming the contributions of each design choice, and scalability to larger datasets, underscoring its practical impact for efficient data-centric learning.

Abstract

Dataset condensation, a concept within data-centric learning, efficiently transfers critical attributes from an original dataset to a synthetic version, maintaining both diversity and realism. This approach significantly improves model training efficiency and is adaptable across multiple application areas. Previous methods in dataset condensation have faced challenges: some incur high computational costs which limit scalability to larger datasets (e.g., MTT, DREAM, and TESLA), while others are restricted to less optimal design spaces, which could hinder potential improvements, especially in smaller datasets (e.g., SRe2L, G-VBSM, and RDED). To address these limitations, we propose a comprehensive design framework that includes specific, effective strategies like implementing soft category-aware matching and adjusting the learning rate schedule. These strategies are grounded in empirical evidence and theoretical backing. Our resulting approach, Elucidate Dataset Condensation (EDC), establishes a benchmark for both small and large-scale dataset condensation. In our testing, EDC achieves state-of-the-art accuracy, reaching 48.6% on ImageNet-1k with a ResNet-18 model at an IPC of 10, which corresponds to a compression ratio of 0.78%. This performance exceeds those of SRe2L, G-VBSM, and RDED by margins of 27.3%, 17.2%, and 6.6%, respectively.
Paper Structure (45 sections, 10 theorems, 36 equations, 11 figures, 25 tables)

This paper contains 45 sections, 10 theorems, 36 equations, 11 figures, 25 tables.

Key Result

Theorem 3.1

(proof in Appendix apd:random_vs_real) Considering samples $\mathcal{X}^\mathcal{S}_\textbf{real}$, $\mathcal{X}^\mathcal{S}_\textbf{free}$, and $\mathcal{X}^\mathcal{S}_\textbf{random}$ from the original data, training-free condensed (e.g., RDED), and Gaussian distributions, respectively, let us as

Figures (11)

  • Figure 1: Illustration of Elucidating Dataset Condensation (EDC).Left: The overall of our better design choices in dataset condensation on ImageNet-1k. Right: The evaluation performance and data synthesis required time of different configurations on ResNet-18 with IPC 10. Our integral EDC refers to $\mathbb{C}\mathbb{O}\mathbb{N}\mathbb{F}\mathbb{I}\mathbb{G}$ G.
  • Figure 2: (a): Illustration of soft category-aware matching $\left(\vcenter{}\right)$ using a Gaussian distribution in $\mathbb{R}^2$. (b): The effect of employing smoothing LR schedule $\left(\vcenter{}\right)$ on loss landscape sharpness reduction. (c) top: The role of flatness regularization $\left(\vcenter{}\right)$ in reducing the Frobenius norm of the Hessian matrix driven by data synthesis iteration. (c) bottom: Cosine similarity comparison between local gradients (obtained from original and distilled datasets via random batch selection) and the global gradient (obtained from gradient accumulation).
  • Figure 3: Comparison between real image initialization and random initialization.
  • Figure 4: Application on ImageNet-1k. We evaluate the effectiveness of data-free network slimming and continual learning using VGG11-BN and ResNet-18, respectively.
  • Figure 5: Visualization of the synthetic images of prior training-dependent dataset condensation methods.
  • ...and 6 more figures

Theorems & Definitions (19)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • proof
  • Definition B.1
  • Lemma B.2
  • proof
  • Lemma B.3
  • proof
  • Theorem B.5
  • ...and 9 more