Table of Contents
Fetching ...

Dataset Distillation as Pushforward Optimal Quantization

Hong Ye Tan, Emma Slade

TL;DR

This paper addresses the computational bottlenecks of dataset distillation by reframing disentangled methods as optimal quantization under a diffusion prior. It develops Dataset Distillation by Optimal Quantization (DDOQ), which encodes data into a latent space, performs weighted clustering, and decodes prototypes through a latent diffusion model to produce distilled data. The authors prove consistency and convergence rates for distilled datasets under score-based diffusion, and show empirically that DDOQ, especially with weighting, achieves competitive or superior performance to state-of-the-art disentangled and diffusion-guided methods on ImageNet-1K and its subsets. The approach reduces memory and compute while maintaining high fidelity of gradients during training, suggesting strong practical impact for efficient large-scale model training and potential applications across diffusion priors and latent-space pipelines.

Abstract

Dataset distillation aims to find a synthetic training set such that training on the synthetic data achieves similar performance to training on real data, with orders of magnitude less computational requirements. Existing methods can be broadly categorized as either bi-level optimization problems that have neural network training heuristics as the lower level problem, or disentangled methods that bypass the bi-level optimization by matching distributions of data. The latter method has the major advantages of speed and scalability in terms of size of both training and distilled datasets. We demonstrate that when equipped with an encoder-decoder structure, the empirically successful disentangled methods can be reformulated as an optimal quantization problem, where a finite set of points is found to approximate the underlying probability measure by minimizing the expected projection distance. In particular, we link existing disentangled dataset distillation methods to the classical optimal quantization and Wasserstein barycenter problems, demonstrating consistency of distilled datasets for diffusion-based generative priors. We propose Dataset Distillation by Optimal Quantization, based on clustering in a latent space. Compared to the previous SOTA method D\textsuperscript{4}M, we achieve better performance and inter-model generalization on the ImageNet-1K dataset with trivial additional computation, and SOTA performance in higher image-per-class settings. Using the distilled noise initializations in a stronger diffusion transformer model, we obtain SOTA distillation performance on ImageNet-1K and its subsets, outperforming diffusion guidance methods.

Dataset Distillation as Pushforward Optimal Quantization

TL;DR

This paper addresses the computational bottlenecks of dataset distillation by reframing disentangled methods as optimal quantization under a diffusion prior. It develops Dataset Distillation by Optimal Quantization (DDOQ), which encodes data into a latent space, performs weighted clustering, and decodes prototypes through a latent diffusion model to produce distilled data. The authors prove consistency and convergence rates for distilled datasets under score-based diffusion, and show empirically that DDOQ, especially with weighting, achieves competitive or superior performance to state-of-the-art disentangled and diffusion-guided methods on ImageNet-1K and its subsets. The approach reduces memory and compute while maintaining high fidelity of gradients during training, suggesting strong practical impact for efficient large-scale model training and potential applications across diffusion priors and latent-space pipelines.

Abstract

Dataset distillation aims to find a synthetic training set such that training on the synthetic data achieves similar performance to training on real data, with orders of magnitude less computational requirements. Existing methods can be broadly categorized as either bi-level optimization problems that have neural network training heuristics as the lower level problem, or disentangled methods that bypass the bi-level optimization by matching distributions of data. The latter method has the major advantages of speed and scalability in terms of size of both training and distilled datasets. We demonstrate that when equipped with an encoder-decoder structure, the empirically successful disentangled methods can be reformulated as an optimal quantization problem, where a finite set of points is found to approximate the underlying probability measure by minimizing the expected projection distance. In particular, we link existing disentangled dataset distillation methods to the classical optimal quantization and Wasserstein barycenter problems, demonstrating consistency of distilled datasets for diffusion-based generative priors. We propose Dataset Distillation by Optimal Quantization, based on clustering in a latent space. Compared to the previous SOTA method D\textsuperscript{4}M, we achieve better performance and inter-model generalization on the ImageNet-1K dataset with trivial additional computation, and SOTA performance in higher image-per-class settings. Using the distilled noise initializations in a stronger diffusion transformer model, we obtain SOTA distillation performance on ImageNet-1K and its subsets, outperforming diffusion guidance methods.
Paper Structure (28 sections, 15 theorems, 41 equations, 2 figures, 7 tables)

This paper contains 28 sections, 15 theorems, 41 equations, 2 figures, 7 tables.

Key Result

Proposition 1

Suppose we have a quantization ${\mathbf{x}} = \{x_1,...,x_K\}$. Assume that the (probability) measure $\mu$ is null on the boundaries of the Voronoi cells $\mu(\partial C_i) = 0$. Then the measure $\nu$ that minimizes the Wasserstein-2 distance eq:wassDistance and satisfies $\mathop{\mathrm{supp}}\

Figures (2)

  • Figure 1: Sketch of the proposed method pipeline. Using an encoder/decoder model, we map our high dimensional data to a low-dimensional space, which is then clustered using $k$-means. The clustered latent points and weights are then decoded to obtain the distilled data. This work argues that the weights are important when decoding; furthermore, "disentangled" distillation using an encoder-cluster-decoder framework is asymptotically consistent.
  • Figure 2: Example distilled images of the "jeep" class in ImageNet-1K along with their $k$-means weights below. There are little to no qualitative features that can be used to differentiate the low and high weighted images, mainly due to the high fidelity of the diffusion model. However, the weights are indicative of the distribution of the training data in the latent space of the diffusion model.

Theorems & Definitions (26)

  • Definition 1: Quadratic distortion
  • Proposition 1
  • proof
  • Remark 1
  • Proposition 2
  • proof
  • Proposition 3: bally2003quantization
  • Remark 2
  • Remark 3
  • Theorem 1
  • ...and 16 more