Elucidating the Design Space of Dataset Condensation
Shitong Shao, Zikai Zhou, Huanran Chen, Zhiqiang Shen
TL;DR
This paper addresses the high computational burden and limited design space of existing dataset condensation methods by introducing Elucidate Dataset Condensation (EDC), a comprehensive framework that optimizes data synthesis, soft label generation, and post-evaluation. Its key innovations include real-image initialization, soft category-aware matching that blends mean/variance and Gaussian Mixture Model statistics, flatness regularization via EMA-based SAM on logits, a smoothing learning-rate schedule, small batch sizes, and EMA-based evaluation. Empirically, EDC achieves state-of-the-art results across CIFAR, Tiny-ImageNet, and ImageNet variants, notably reaching $48.6\%$ top-1 on ImageNet-1k with IPC $10$ using ResNet-18 and outperforming prior methods by substantial margins. The work also demonstrates cross-architecture generalization, ablations confirming the contributions of each design choice, and scalability to larger datasets, underscoring its practical impact for efficient data-centric learning.
Abstract
Dataset condensation, a concept within data-centric learning, efficiently transfers critical attributes from an original dataset to a synthetic version, maintaining both diversity and realism. This approach significantly improves model training efficiency and is adaptable across multiple application areas. Previous methods in dataset condensation have faced challenges: some incur high computational costs which limit scalability to larger datasets (e.g., MTT, DREAM, and TESLA), while others are restricted to less optimal design spaces, which could hinder potential improvements, especially in smaller datasets (e.g., SRe2L, G-VBSM, and RDED). To address these limitations, we propose a comprehensive design framework that includes specific, effective strategies like implementing soft category-aware matching and adjusting the learning rate schedule. These strategies are grounded in empirical evidence and theoretical backing. Our resulting approach, Elucidate Dataset Condensation (EDC), establishes a benchmark for both small and large-scale dataset condensation. In our testing, EDC achieves state-of-the-art accuracy, reaching 48.6% on ImageNet-1k with a ResNet-18 model at an IPC of 10, which corresponds to a compression ratio of 0.78%. This performance exceeds those of SRe2L, G-VBSM, and RDED by margins of 27.3%, 17.2%, and 6.6%, respectively.
