The Emergence of Reproducibility and Generalizability in Diffusion Models

Huijie Zhang; Jinfan Zhou; Yifu Lu; Minzhe Guo; Peng Wang; Liyue Shen; Qing Qu

The Emergence of Reproducibility and Generalizability in Diffusion Models

Huijie Zhang, Jinfan Zhou, Yifu Lu, Minzhe Guo, Peng Wang, Liyue Shen, Qing Qu

TL;DR

The paper investigates a phenomenon called consistent model reproducibility in diffusion models, showing that different models trained on the same data with a deterministic sampler produce highly similar outputs from identical noise inputs. It introduces RP and GL metrics to quantify reproducibility and generalizability, and demonstrates that reproducibility arises in two regimes: memorization (learning the empirical data distribution) and generalization (learning the underlying distribution). The authors validate this across unconditional, conditional, inverse-problem, and fine-tuned diffusion models, and contrast with GANs/VAEs where reproducibility is rare. They discuss practical implications for training efficiency, privacy, and controllable data generation, and outline theoretical questions about how diffusion models map noise to data distributions. Key ideas are anchored in score-function learning and distribution learning, supported by toy MoG analyses and experiments with pre-trained diffusion models.

Abstract

In this work, we investigate an intriguing and prevalent phenomenon of diffusion models which we term as "consistent model reproducibility": given the same starting noise input and a deterministic sampler, different diffusion models often yield remarkably similar outputs. We confirm this phenomenon through comprehensive experiments, implying that different diffusion models consistently reach the same data distribution and scoring function regardless of diffusion model frameworks, model architectures, or training procedures. More strikingly, our further investigation implies that diffusion models are learning distinct distributions affected by the training data size. This is supported by the fact that the model reproducibility manifests in two distinct training regimes: (i) "memorization regime", where the diffusion model overfits to the training data distribution, and (ii) "generalization regime", where the model learns the underlying data distribution. Our study also finds that this valuable property generalizes to many variants of diffusion models, including those for conditional use, solving inverse problems, and model fine-tuning. Finally, our work raises numerous intriguing theoretical questions for future investigation and highlights practical implications regarding training efficiency, model privacy, and the controlled generation of diffusion models.

The Emergence of Reproducibility and Generalizability in Diffusion Models

TL;DR

Abstract

Paper Structure (46 sections, 5 theorems, 31 equations, 23 figures, 1 table, 1 algorithm)

This paper contains 46 sections, 5 theorems, 31 equations, 23 figures, 1 table, 1 algorithm.

Introduction
Summary of contributions.
Theoretical and practical implications of our work.
Notations.
Organization of the paper.
Consistent Model Reproducibility
Measures of Reproducibility and Generalizability
Measure of model reproducbility.
Measure of model generalizability.
Model Reproducibility Manifests in Two Regimes
Reproducibility is Rare in Generative Models
Analyzing Reproducibility in Two Regimes
Reproducibility in Memorization Regime
Reproducibility in Generalization Regime
Reproducibility & Distribution Learning
...and 31 more sections

Key Result

Lemma 1

Suppose the distribution learned by diffusion model is $p(\bm x_0)$ and the perturbation kernel $p_{t}(\bm x_t|\bm x_0) = \mathcal{N}(\bm x_t;s_t\bm x_0, s_t^2\sigma_t^2\textbf{I})$ with perturbation parameters $s_t, \sigma_t$. The ideal score function has the following form

Figures (23)

Figure 1: Visualization of generation samples from different diffusion models. We utilized denoising diffusion probabilistic models (DDPM) ho2020denoisingsong2020denoising, consistency model (CT) song2023consistency, U-ViT bao2023all trained on CIFAR-10 krizhevsky2009learning dataset. Samples in the corresponding row and column are generated from the same initial noise with a deterministic ODE sampler.
Figure 2: "Memorization" and "Generalization" regimes for unconditional diffusion models. We utilize DDPMv4 and train them on the CIFAR-10 dataset, adjusting both the model's size and the size of the training dataset. In terms of model size, we experiment with UNet-64, UNet-128, and UNet-256, where, for instance, UNet-64 indicates a UNet structure with an embedding dimension of 64. As for the dataset size, we select images from the CIFAR dataset, ranging from $2^6$ to $2^{15}$. Under each dataset size, different models are trained from the same subset of images. The figure on the left displays the reproducibility score as we compare various models across different dataset sizes, while the figure on the right illustrates the generalizability score of the models as the dataset size changes.
Figure 3: Quantitative results for GANS and VAEs. In our evaluation of GAN-based methods, we utilize two architectures: Wasserstein GAN (wGAN) arjovsky2017wasserstein and Spectral Normalization GAN (SNGAN) miyato2018spectral training on the CIFAR-10 dataset. For VAE-based approaches, we consider both the standard VAE and the Variational Autoencoding Mutual Information Bottleneck (VAMP) model tomczak2018vae training on the MNIST lecun1998gradient dataset.
Figure 4: Convergence of the optimal denoiser (left) and training loss (right) w.r.t. the training data size. We employ DDPMv4 and conduct training on the CIFAR-10 dataset. During this process, we make modifications to both the model's capacity and the size of the training dataset, maintaining the same configuration as depicted in \ref{['fig:two_regime']}. The left figure illustrates the reproducibility score between each diffusion model and the theoretically unique identifiable encoding as outlined in \ref{['proposition:empirical distribution']}, the right figure illustrates the training loss for these models when trained till converge.
Figure 5: Score matching accuracy. We train the same diffusion model with varying numbers of training samples $N$ and subspace dimension $r$ from the Mixture of Gaussian distribution defined in \ref{['eqn:mlg']} and plot the metric $\mathcal{L}_{\text{score}}$ in different colors for each $r$. The detailed experimental settings are in \ref{['append:MoG']}.
...and 18 more figures

Theorems & Definitions (8)

Lemma 1
Proposition 1
Proposition 2
Proposition 3.2
proof
proof
Proposition 3.3
proof

The Emergence of Reproducibility and Generalizability in Diffusion Models

TL;DR

Abstract

The Emergence of Reproducibility and Generalizability in Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (23)

Theorems & Definitions (8)