Table of Contents
Fetching ...

Learnability-Guided Diffusion for Dataset Distillation

Jeffrey A. Chan-Santiago, Mubarak Shah

Abstract

Training machine learning models on massive datasets is expensive and time-consuming. Dataset distillation addresses this by creating a small synthetic dataset that achieves the same performance as the full dataset. Recent methods use diffusion models to generate distilled data, either by promoting diversity or matching training gradients. However, existing approaches produce redundant training signals, where samples convey overlapping information. Empirically, disjoint subsets of distilled datasets capture 80-90% overlapping signals. This redundancy stems from optimizing visual diversity or average training dynamics without accounting for similarity across samples, leading to datasets where multiple samples share similar information rather than complementary knowledge. We propose learnability-driven dataset distillation, which constructs synthetic datasets incrementally through successive stages. Starting from a small set, we train a model and generate new samples guided by learnability scores that identify what the current model can learn from, creating an adaptive curriculum. We introduce Learnability-Guided Diffusion (LGD), which balances training utility for the current model with validity under a reference model to generate curriculum-aligned samples. Our approach reduces redundancy by 39.1%, promotes specialization across training stages, and achieves state-of-the-art results on ImageNet-1K (60.1%), ImageNette (87.2%), and ImageWoof (72.9%). Our code is available on our project page https://jachansantiago.github.io/learnability-guided-distillation/.

Learnability-Guided Diffusion for Dataset Distillation

Abstract

Training machine learning models on massive datasets is expensive and time-consuming. Dataset distillation addresses this by creating a small synthetic dataset that achieves the same performance as the full dataset. Recent methods use diffusion models to generate distilled data, either by promoting diversity or matching training gradients. However, existing approaches produce redundant training signals, where samples convey overlapping information. Empirically, disjoint subsets of distilled datasets capture 80-90% overlapping signals. This redundancy stems from optimizing visual diversity or average training dynamics without accounting for similarity across samples, leading to datasets where multiple samples share similar information rather than complementary knowledge. We propose learnability-driven dataset distillation, which constructs synthetic datasets incrementally through successive stages. Starting from a small set, we train a model and generate new samples guided by learnability scores that identify what the current model can learn from, creating an adaptive curriculum. We introduce Learnability-Guided Diffusion (LGD), which balances training utility for the current model with validity under a reference model to generate curriculum-aligned samples. Our approach reduces redundancy by 39.1%, promotes specialization across training stages, and achieves state-of-the-art results on ImageNet-1K (60.1%), ImageNette (87.2%), and ImageWoof (72.9%). Our code is available on our project page https://jachansantiago.github.io/learnability-guided-distillation/.

Paper Structure

This paper contains 39 sections, 9 equations, 14 figures, 8 tables, 2 algorithms.

Figures (14)

  • Figure 1: Learnability-Guided Dataset Distillation. We partition the distilled dataset $\mathcal{D}_S$ into increments $\{\mathcal{I}_0, \mathcal{I}_1, \ldots, \mathcal{I}_k\}$ ( Top). Bottom Left (DiT): Standard distillation generates increments independently, producing redundant samples---a model trained on $\mathcal{I}_0$ achieves 98.0% accuracy on $\mathcal{I}_1$, indicating no new information. Bottom Right (LGD): We condition the next increment on the model parameters $\theta_{\mathcal{I}_0}$ to guide synthesis toward samples that complement $\mathcal{I}_0$. The resulting increment $\mathcal{I}_1$ achieves only 17.0% accuracy when evaluated by the prior model, indicating it introduces substantial new learning signal.
  • Figure 2: Cross-validation across distilled data increments ($\mathcal{I}_1\!-\!\mathcal{I}_5$) for IPC 50 on ImageNette. Each heatmap shows accuracy when training on one increment (rows) and evaluating on another (columns). DiTpeebles2023scalable and IGD chen2025igd exhibit high cross-increment accuracy due to overlapping information, while LGD yields lower off-diagonal scores, indicating more complementary and diverse increments.
  • Figure 3: Overview of our learnability-guided iterative generation framework. (Top) Incremental distillation loop: we iteratively train model $\theta_t$ on cumulative dataset $\mathcal{D}_t$, generate samples using our learnability guidance, select high-quality samples via learnability ranking, and augment the dataset. (Bottom) Effect on sample space: The current model $\theta_t$ (green) expands over iterations, while the reference model $\theta^*$ (purple, fixed) defines the learnable region. Generated samples (red $\times$) land in the learnable gap between boundaries, automatically synthesizing samples that complement the current model's learned distribution.
  • Figure 4: Incremental training dynamics of DiT and our method. (a-b) show the training loss across successive data increments ($D_1 \rightarrow D_5$), where each increment adds new samples followed by a 300-epoch learning-rate decay (light beige). Our method yields stronger loss spikes ($\Delta$) after each increment, suggesting the added data is harder and complementary. (c) compares normalized validation accuracy per increment between DiT and our method, highlighting consistent accuracy gains and faster convergence for ours.
  • Figure 5: Learning-dynamics visualization of original and distilled samples. Each point shows a sample's mean and standard deviation of ground-truth class probability across training (50 epochs). Top-left points are easy (high $\mu$, low $\sigma^2$), bottom-left are hard, and mid-right indicate informative samples. Our method yields distilled samples that form a more informative (16.2%) and harder dataset (2.6%), roughly 3$\times$ and 2$\times$ over IGD, respectively, aligning more closely with the original training dynamics distribution, as shown by the lowest Jensen--Shannon divergence (JS $\downarrow$).
  • ...and 9 more figures