Table of Contents
Fetching ...

Beyond Dataset Distillation: Lossless Dataset Concentration via Diffusion-Assisted Distribution Alignment

Tongfei Liu, Yufan Liu, Bing Li, Weiming Hu

Abstract

The high cost and accessibility problem associated with large datasets hinder the development of large-scale visual recognition systems. Dataset Distillation addresses these problems by synthesizing compact surrogate datasets for efficient training, storage, transfer, and privacy preservation. The existing state-of-the-art diffusion-based dataset distillation methods face three issues: lack of theoretical justification, poor efficiency in scaling to high data volumes, and failure in data-free scenarios. To address these issues, we establish a theoretical framework that justifies the use of diffusion models by proving the equivalence between dataset distillation and distribution matching, and reveals an inherent efficiency limit in the dataset distillation paradigm. We then propose a Dataset Concentration (DsCo) framework that uses a diffusion-based Noise-Optimization (NOpt) method to synthesize a small yet representative set of samples, and optionally augments the synthetic data via "Doping", which mixes selected samples from the original dataset with the synthetic samples to overcome the efficiency limit of dataset distillation. DsCo is applicable in both data-accessible and data-free scenarios, achieving SOTA performances for low data volumes, and it extends well to high data volumes, where it nearly reduces the dataset size by half with no performance degradation.

Beyond Dataset Distillation: Lossless Dataset Concentration via Diffusion-Assisted Distribution Alignment

Abstract

The high cost and accessibility problem associated with large datasets hinder the development of large-scale visual recognition systems. Dataset Distillation addresses these problems by synthesizing compact surrogate datasets for efficient training, storage, transfer, and privacy preservation. The existing state-of-the-art diffusion-based dataset distillation methods face three issues: lack of theoretical justification, poor efficiency in scaling to high data volumes, and failure in data-free scenarios. To address these issues, we establish a theoretical framework that justifies the use of diffusion models by proving the equivalence between dataset distillation and distribution matching, and reveals an inherent efficiency limit in the dataset distillation paradigm. We then propose a Dataset Concentration (DsCo) framework that uses a diffusion-based Noise-Optimization (NOpt) method to synthesize a small yet representative set of samples, and optionally augments the synthetic data via "Doping", which mixes selected samples from the original dataset with the synthetic samples to overcome the efficiency limit of dataset distillation. DsCo is applicable in both data-accessible and data-free scenarios, achieving SOTA performances for low data volumes, and it extends well to high data volumes, where it nearly reduces the dataset size by half with no performance degradation.

Paper Structure

This paper contains 70 sections, 2 theorems, 31 equations, 8 figures, 13 tables.

Key Result

Proposition 1

For a model $\Phi$ which makes predictions on $\mathcal{T}$ by memorizing samples in $\mathcal{S}$, if Assumption assum1 holds, the maximization of the total expected chance of recognition of the samples in $\mathcal{T}$ is equivalent to the minimization of the continuous distribution discrepancy be

Figures (8)

  • Figure 1: The graphical illustration of the target features and surrogate features plotted in the RKHS.
  • Figure 2: The Dataset Concentration framework. Enclosed in the gray region is the iterative denoising process of NOpt; the dashed red arrows indicate the gradient flow during each noise-optimization.
  • Figure 3: ResNet-18 Accuracy vs IPC for ImageNet-1k. The sub-figure enclosed in the major figure depicts the 10IPC and 50IPC performances, which are commonly reported in the dataset distillation literature.
  • Figure 4: Synthesis costs, measured in running time on NVIDIA GeForce RTX 2080 Ti GPUs, plotted against IPC for contemporary open-source diffusion-based dataset distillation or concentration methods. The dashed lines are estimated costs obtained through linear extrapolation.
  • Figure 5: Visualization of the real samples (left), DiT synthetic samples (middle left), DsCo synthetic samples (middle right), and the DFDsCo synthetic samples (right) of the Church class of the ImageNette dataset.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Proposition 1
  • Proposition 2