Table of Contents
Fetching ...

Data Pruning in Generative Diffusion Models

Rania Briq, Jiangtao Wang, Stefan Kesselheim

TL;DR

The paper investigates data pruning for generative diffusion models, addressing whether pruning can improve training efficiency while maintaining output quality. It evaluates a suite of pruning strategies, including clustering-based methods with CLIP and DINO embeddings, within a flow-matching diffusion framework that uses a VQ-VAE latent space. Results show diffusion models are surprisingly tolerant to data reduction, with ImageNet benefiting most from clustering-based pruning and CelebA-HQ showing little gain from clustering; aggressive pruning can still degrade novelty. The work demonstrates practical benefits for data-efficient training and, importantly, presents unsupervised methods to balance skewed data distributions, enhancing fairness in generative modeling.

Abstract

Data pruning is the problem of identifying a core subset that is most beneficial to training and discarding the remainder. While pruning strategies are well studied for discriminative models like those used in classification, little research has gone into their application to generative models. Generative models aim to estimate the underlying distribution of the data, so presumably they should benefit from larger datasets. In this work we aim to shed light on the accuracy of this statement, specifically answer the question of whether data pruning for generative diffusion models could have a positive impact. Contrary to intuition, we show that eliminating redundant or noisy data in large datasets is beneficial particularly when done strategically. We experiment with several pruning methods including recent-state-of-art methods, and evaluate over CelebA-HQ and ImageNet datasets. We demonstrate that a simple clustering method outperforms other sophisticated and computationally demanding methods. We further exhibit how we can leverage clustering to balance skewed datasets in an unsupervised manner to allow fair sampling for underrepresented populations in the data distribution, which is a crucial problem in generative models.

Data Pruning in Generative Diffusion Models

TL;DR

The paper investigates data pruning for generative diffusion models, addressing whether pruning can improve training efficiency while maintaining output quality. It evaluates a suite of pruning strategies, including clustering-based methods with CLIP and DINO embeddings, within a flow-matching diffusion framework that uses a VQ-VAE latent space. Results show diffusion models are surprisingly tolerant to data reduction, with ImageNet benefiting most from clustering-based pruning and CelebA-HQ showing little gain from clustering; aggressive pruning can still degrade novelty. The work demonstrates practical benefits for data-efficient training and, importantly, presents unsupervised methods to balance skewed data distributions, enhancing fairness in generative modeling.

Abstract

Data pruning is the problem of identifying a core subset that is most beneficial to training and discarding the remainder. While pruning strategies are well studied for discriminative models like those used in classification, little research has gone into their application to generative models. Generative models aim to estimate the underlying distribution of the data, so presumably they should benefit from larger datasets. In this work we aim to shed light on the accuracy of this statement, specifically answer the question of whether data pruning for generative diffusion models could have a positive impact. Contrary to intuition, we show that eliminating redundant or noisy data in large datasets is beneficial particularly when done strategically. We experiment with several pruning methods including recent-state-of-art methods, and evaluate over CelebA-HQ and ImageNet datasets. We demonstrate that a simple clustering method outperforms other sophisticated and computationally demanding methods. We further exhibit how we can leverage clustering to balance skewed datasets in an unsupervised manner to allow fair sampling for underrepresented populations in the data distribution, which is a crucial problem in generative models.

Paper Structure

This paper contains 10 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Qualitative results generated for different PRs for pruning methods and their inverse while varying PR.
  • Figure 2: Qualitative results that display samples while varying PR. Notice that the quality does not degrade as PR increases, in agreement with the FID curve.
  • Figure 3: Qualitative results showing the clean samples using clustering-based pruning for PR=0.5.
  • Figure 4: Qualitative results showing the clean samples of MoSo and GraNd and their inverse
  • Figure 5: Balanced sampling from clusters attempts to generate minority populations.
  • ...and 3 more figures