Table of Contents
Fetching ...

Slight Corruption in Pre-training Data Makes Better Diffusion Models

Hao Chen, Yujin Han, Diganta Misra, Xiang Li, Kai Hu, Difan Zou, Masashi Sugiyama, Jindong Wang, Bhiksha Raj

TL;DR

This paper synthetically corrupts ImageNet-1K and CC3M to pre-train and evaluate over 50 conditional DMs, and proposes a simple method to improve the training of DMs on practical datasets by adding condition embedding perturbations (CEP).

Abstract

Diffusion models (DMs) have shown remarkable capabilities in generating realistic high-quality images, audios, and videos. They benefit significantly from extensive pre-training on large-scale datasets, including web-crawled data with paired data and conditions, such as image-text and image-class pairs. Despite rigorous filtering, these pre-training datasets often inevitably contain corrupted pairs where conditions do not accurately describe the data. This paper presents the first comprehensive study on the impact of such corruption in pre-training data of DMs. We synthetically corrupt ImageNet-1K and CC3M to pre-train and evaluate over 50 conditional DMs. Our empirical findings reveal that various types of slight corruption in pre-training can significantly enhance the quality, diversity, and fidelity of the generated images across different DMs, both during pre-training and downstream adaptation stages. Theoretically, we consider a Gaussian mixture model and prove that slight corruption in the condition leads to higher entropy and a reduced 2-Wasserstein distance to the ground truth of the data distribution generated by the corruptly trained DMs. Inspired by our analysis, we propose a simple method to improve the training of DMs on practical datasets by adding condition embedding perturbations (CEP). CEP significantly improves the performance of various DMs in both pre-training and downstream tasks. We hope that our study provides new insights into understanding the data and pre-training processes of DMs and all models are released at https://huggingface.co/DiffusionNoise.

Slight Corruption in Pre-training Data Makes Better Diffusion Models

TL;DR

This paper synthetically corrupts ImageNet-1K and CC3M to pre-train and evaluate over 50 conditional DMs, and proposes a simple method to improve the training of DMs on practical datasets by adding condition embedding perturbations (CEP).

Abstract

Diffusion models (DMs) have shown remarkable capabilities in generating realistic high-quality images, audios, and videos. They benefit significantly from extensive pre-training on large-scale datasets, including web-crawled data with paired data and conditions, such as image-text and image-class pairs. Despite rigorous filtering, these pre-training datasets often inevitably contain corrupted pairs where conditions do not accurately describe the data. This paper presents the first comprehensive study on the impact of such corruption in pre-training data of DMs. We synthetically corrupt ImageNet-1K and CC3M to pre-train and evaluate over 50 conditional DMs. Our empirical findings reveal that various types of slight corruption in pre-training can significantly enhance the quality, diversity, and fidelity of the generated images across different DMs, both during pre-training and downstream adaptation stages. Theoretically, we consider a Gaussian mixture model and prove that slight corruption in the condition leads to higher entropy and a reduced 2-Wasserstein distance to the ground truth of the data distribution generated by the corruptly trained DMs. Inspired by our analysis, we propose a simple method to improve the training of DMs on practical datasets by adding condition embedding perturbations (CEP). CEP significantly improves the performance of various DMs in both pre-training and downstream tasks. We hope that our study provides new insights into understanding the data and pre-training processes of DMs and all models are released at https://huggingface.co/DiffusionNoise.
Paper Structure (44 sections, 9 theorems, 92 equations, 33 figures, 7 tables)

This paper contains 44 sections, 9 theorems, 92 equations, 33 figures, 7 tables.

Key Result

Theorem 1

For any class $k \in \mathcal{Y}$ and sufficiently large length $T$, assuming the norm of corresponding expectation $\|{\boldsymbol{\mu}}_k\|^2_2$ is a constant and the empirical covariance of training data is full rank, let $\mathbf{z}_T$ and ${{\mathbf{z}}}^c_T$ be the generation with clean and co where $\gamma$ is the corruption control parameter and $d$ is the data dimension.

Figures (33)

  • Figure 1: Visualization from class and text-conditional DMs pre-trained with clean, slight, and severe condition corruption. Slight corruption in pre-training improves the quality and diversity of images.
  • Figure 2: (a) FID and (b) IS of DMs pre-trained on IN-1K and CC3M with various corruption. Slight corruption of various types helps DMs achieve better performance, compared to the clean ones.
  • Figure 3: Quantitative evaluation of generated images from class and text-conditional LDMs pre-trained with condition corruption. All metrics are computed over $50K$ generated images and validation images of IN-1K and MS-COCO. We plot FID vs. IS or CS ((a) and (c)) , and Precision vs. Recall ((b) and (d)), where each point indicates the results computed from using a guidance scale. Models pre-trained with slight condition corruption achieve better FID, IS or CS, and PR trade-off.
  • Figure 4: Quantitative evaluation of complexity and diversity of class and text-conditional LDMs. We plot the top-$1\%$ RMD score ((a) and (c)) which measures the complexity and diversity of samples (with $s=2.0$ and $s=3.0$ for IN-1K and CC3M LDMs), and the sample entropy ((b) and (d)) as a proxy measure of diversity, where each point indicates the result of a guidance scale. Models pre-trained with slight condition corruption generate samples of higher complexity and diversity.
  • Figure 5: Qualitative evaluation of images generated from circular walk around the learned latent space using (a) class-conditional IN-1K LDMs and (b) text-conditional CC3M LDMs. Models pre-trained with slight condition corruption present more diversity in the learned distribution.
  • ...and 28 more figures

Theorems & Definitions (16)

  • Theorem 1
  • Theorem 2
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • ...and 6 more