Slight Corruption in Pre-training Data Makes Better Diffusion Models

Hao Chen; Yujin Han; Diganta Misra; Xiang Li; Kai Hu; Difan Zou; Masashi Sugiyama; Jindong Wang; Bhiksha Raj

Slight Corruption in Pre-training Data Makes Better Diffusion Models

Hao Chen, Yujin Han, Diganta Misra, Xiang Li, Kai Hu, Difan Zou, Masashi Sugiyama, Jindong Wang, Bhiksha Raj

TL;DR

This paper synthetically corrupts ImageNet-1K and CC3M to pre-train and evaluate over 50 conditional DMs, and proposes a simple method to improve the training of DMs on practical datasets by adding condition embedding perturbations (CEP).

Abstract

Diffusion models (DMs) have shown remarkable capabilities in generating realistic high-quality images, audios, and videos. They benefit significantly from extensive pre-training on large-scale datasets, including web-crawled data with paired data and conditions, such as image-text and image-class pairs. Despite rigorous filtering, these pre-training datasets often inevitably contain corrupted pairs where conditions do not accurately describe the data. This paper presents the first comprehensive study on the impact of such corruption in pre-training data of DMs. We synthetically corrupt ImageNet-1K and CC3M to pre-train and evaluate over 50 conditional DMs. Our empirical findings reveal that various types of slight corruption in pre-training can significantly enhance the quality, diversity, and fidelity of the generated images across different DMs, both during pre-training and downstream adaptation stages. Theoretically, we consider a Gaussian mixture model and prove that slight corruption in the condition leads to higher entropy and a reduced 2-Wasserstein distance to the ground truth of the data distribution generated by the corruptly trained DMs. Inspired by our analysis, we propose a simple method to improve the training of DMs on practical datasets by adding condition embedding perturbations (CEP). CEP significantly improves the performance of various DMs in both pre-training and downstream tasks. We hope that our study provides new insights into understanding the data and pre-training processes of DMs and all models are released at https://huggingface.co/DiffusionNoise.

Slight Corruption in Pre-training Data Makes Better Diffusion Models

TL;DR

Abstract

Paper Structure (44 sections, 9 theorems, 92 equations, 33 figures, 7 tables)

This paper contains 44 sections, 9 theorems, 92 equations, 33 figures, 7 tables.

Introduction
Preliminary
Understanding the Pre-training Corruption in Diffusion Models
Pre-training Evaluation
Downstream Personalization Evaluation
Discussion: Other Types of Pre-training Corruption and Diffusion Models
Theoretical Analysis
Generation Diversity: Clean vs. Corrupted Conditions
Generation Quality: Clean vs. Corrupted Conditions
Improving Diffusion Models with Conditional Embedding Perturbation
Method
Experiments
Related Work
Conclusion and Limitation
Derivations and Proofs
...and 29 more sections

Key Result

Theorem 1

For any class $k \in \mathcal{Y}$ and sufficiently large length $T$, assuming the norm of corresponding expectation $\|{\boldsymbol{\mu}}_k\|^2_2$ is a constant and the empirical covariance of training data is full rank, let $\mathbf{z}_T$ and ${{\mathbf{z}}}^c_T$ be the generation with clean and co where $\gamma$ is the corruption control parameter and $d$ is the data dimension.

Figures (33)

Figure 1: Visualization from class and text-conditional DMs pre-trained with clean, slight, and severe condition corruption. Slight corruption in pre-training improves the quality and diversity of images.
Figure 2: (a) FID and (b) IS of DMs pre-trained on IN-1K and CC3M with various corruption. Slight corruption of various types helps DMs achieve better performance, compared to the clean ones.
Figure 3: Quantitative evaluation of generated images from class and text-conditional LDMs pre-trained with condition corruption. All metrics are computed over $50K$ generated images and validation images of IN-1K and MS-COCO. We plot FID vs. IS or CS ((a) and (c)) , and Precision vs. Recall ((b) and (d)), where each point indicates the results computed from using a guidance scale. Models pre-trained with slight condition corruption achieve better FID, IS or CS, and PR trade-off.
Figure 4: Quantitative evaluation of complexity and diversity of class and text-conditional LDMs. We plot the top-$1\%$ RMD score ((a) and (c)) which measures the complexity and diversity of samples (with $s=2.0$ and $s=3.0$ for IN-1K and CC3M LDMs), and the sample entropy ((b) and (d)) as a proxy measure of diversity, where each point indicates the result of a guidance scale. Models pre-trained with slight condition corruption generate samples of higher complexity and diversity.
Figure 5: Qualitative evaluation of images generated from circular walk around the learned latent space using (a) class-conditional IN-1K LDMs and (b) text-conditional CC3M LDMs. Models pre-trained with slight condition corruption present more diversity in the learned distribution.
...and 28 more figures

Theorems & Definitions (16)

Theorem 1
Theorem 2
Lemma 1
proof
Lemma 2
proof
Lemma 3
proof
Lemma 4
proof
...and 6 more

Slight Corruption in Pre-training Data Makes Better Diffusion Models

TL;DR

Abstract

Slight Corruption in Pre-training Data Makes Better Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (33)

Theorems & Definitions (16)