Table of Contents
Fetching ...

Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation

Yanda Chen, Gongwei Chen, Miao Zhang, Weili Guan, Liqiang Nie

Abstract

Dataset distillation (DD) excels in synthesizing a small number of images per class (IPC) but struggles to maintain its effectiveness in high-IPC settings. Recent works on dataset distillation demonstrate that combining distilled and real data can mitigate the effectiveness decay. However, our analysis of the combination paradigm reveals that the current one-shot and independent selection mechanism induces an incompatibility issue between distilled and real images. To address this issue, we introduce a novel curriculum coarse-to-fine selection (CCFS) method for efficient high-IPC dataset distillation. CCFS employs a curriculum selection framework for real data selection, where we leverage a coarse-to-fine strategy to select appropriate real data based on the current synthetic dataset in each curriculum. Extensive experiments validate CCFS, surpassing the state-of-the-art by +6.6\% on CIFAR-10, +5.8\% on CIFAR-100, and +3.4\% on Tiny-ImageNet under high-IPC settings. Notably, CCFS achieves 60.2\% test accuracy on ResNet-18 with a 20\% compression ratio of Tiny-ImageNet, closely matching full-dataset training with only 0.3\% degradation. Code: https://github.com/CYDaaa30/CCFS.

Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation

Abstract

Dataset distillation (DD) excels in synthesizing a small number of images per class (IPC) but struggles to maintain its effectiveness in high-IPC settings. Recent works on dataset distillation demonstrate that combining distilled and real data can mitigate the effectiveness decay. However, our analysis of the combination paradigm reveals that the current one-shot and independent selection mechanism induces an incompatibility issue between distilled and real images. To address this issue, we introduce a novel curriculum coarse-to-fine selection (CCFS) method for efficient high-IPC dataset distillation. CCFS employs a curriculum selection framework for real data selection, where we leverage a coarse-to-fine strategy to select appropriate real data based on the current synthetic dataset in each curriculum. Extensive experiments validate CCFS, surpassing the state-of-the-art by +6.6\% on CIFAR-10, +5.8\% on CIFAR-100, and +3.4\% on Tiny-ImageNet under high-IPC settings. Notably, CCFS achieves 60.2\% test accuracy on ResNet-18 with a 20\% compression ratio of Tiny-ImageNet, closely matching full-dataset training with only 0.3\% degradation. Code: https://github.com/CYDaaa30/CCFS.

Paper Structure

This paper contains 32 sections, 6 equations, 11 figures, 13 tables, 1 algorithm.

Figures (11)

  • Figure 1: Comparison of combination-based dataset distillation.Top: General paradigm. Bottom: (a) SelMatch conducts an independent and one-shot selection of $\mathcal{D}_\textrm{real}$. (b) Our method applies curriculum selection, making $\mathcal{D}_\textrm{real}$ dependent on $\mathcal{D}_\textrm{distill}$.
  • Figure 2: Results of the analysis experiments on CIFAR-100. (a) Top-1 accuracy of the 3 settings with IPC=25, 50, 100, 150. In each IPC, setting \ref{['group:b']}, which modifies only the selection strategy of $\mathcal{D}_\textrm{real}$, outperforms setting \ref{['group:a']} with the original SelMatch setup. Setting \ref{['group:c']} reverses setting \ref{['group:b']}’s process by first distilling $\mathcal{D}_\textrm{distill}$ and then conducting a two-shot selection to obtain $\mathcal{D}_\textrm{real}$, resulting in the best performance among the 3 groups. (b) A detailed comparison between setting \ref{['group:a']} and \ref{['group:c']} at various window starting point $\beta$. In all cases of $\beta$, setting \ref{['group:c']} outperforms setting \ref{['group:a']} and shows more stable performance fluctuations across different $\beta$.
  • Figure 3: Architecture of our curriculum coarse-to-fine selection method for high-IPC dataset distillation, CCFS. CCFS adopts a combination of distilled and real data to construct the final synthetic dataset. We apply a curriculum framework and select the optimal real data for the current synthetic dataset in each curriculum. (a) Curriculum selection framework: CCFS begins the curriculum with the already distilled data as the initial synthetic dataset. Then continuously incorporates real data into the current synthetic dataset through the coarse-to-fine selection within each curriculum phase. (b) Coarse-to-fine selection strategy: In the coarse stage, CCFS trains a filter model on the current synthetic dataset and evaluates it on the original dataset excluding already selected data to filter out all correctly classified samples. In the fine stage, CCFS selects the simplest misclassified samples and incorporates them into the current synthetic dataset for the next curriculum.
  • Figure 4: Further analysis on the curriculum framework. (a) Performance of the filter model trained on the synthetic dataset in each curriculum phase with IPC=50: The filter’s classification accuracy steadily improves on both the original training set and the validation set. (b) The difficulty distribution of real samples selected in each curriculum phase: As the curriculum progresses, both the average difficulty as well as the upper and lower difficulty bounds of selected samples increase significantly. Moreover, higher IPC tend to include more difficult samples than lower IPC within the same curriculum phase. CCFS effectively guides the synthetic dataset to incorporate more challenging samples. (c) Visualization of the samples selected in each curriculum phase. We present images of median difficulty across several categories in Tiny-ImageNet: Albatross, School Bus and Banana. The visualization effectively illustrates the gradual increase in difficulty (diverse poses, complex backgrounds, other distractions...) facilitated by CCFS.
  • Figure 5: Impact of different distillation portion $\alpha$ on CIFAR-10/100 and Tiny-ImageNet. We recommend a small distillation portion $\alpha$ in high-IPC settings.
  • ...and 6 more figures