Table of Contents
Fetching ...

Is Your Conditional Diffusion Model Actually Denoising?

Daniel Pfrommer, Zehao Dou, Christopher Scarvelis, Max Simchowitz, Ali Jadbabaie

TL;DR

The paper reveals that conditional diffusion models inherently exhibit non-denoising behavior, quantified by Schedule Deviation (SD), a measure of deviation from the model-consistent diffusion path. SD is calculable without access to the true score or training data and is strongly predictive of disagreements between samplers like DDPM and DDIM. Empirical results across multiple datasets show SD is pervasive and persists despite larger models or more data, while a theoretical framework attributes this to smoothing-based self-guidance across conditioning variables. Toy datasets and analytic results substantiate that self-guidance can cause interpolated flows to deviate from denoising, suggesting a fundamental bias in conditional diffusion that has implications for sampling, distillation, and the interpretation of diffusion-based methods.

Abstract

We study the inductive biases of diffusion models with a conditioning-variable, which have seen widespread application as both text-conditioned generative image models and observation-conditioned continuous control policies. We observe that when these models are queried conditionally, their generations consistently deviate from the idealized "denoising" process upon which diffusion models are formulated, inducing disagreement between popular sampling algorithms (e.g. DDPM, DDIM). We introduce Schedule Deviation, a rigorous measure which captures the rate of deviation from a standard denoising process, and provide a methodology to compute it. Crucially, we demonstrate that the deviation from an idealized denoising process occurs irrespective of the model capacity or amount of training data. We posit that this phenomenon occurs due to the difficulty of bridging distinct denoising flows across different parts of the conditioning space and show theoretically how such a phenomenon can arise through an inductive bias towards smoothness.

Is Your Conditional Diffusion Model Actually Denoising?

TL;DR

The paper reveals that conditional diffusion models inherently exhibit non-denoising behavior, quantified by Schedule Deviation (SD), a measure of deviation from the model-consistent diffusion path. SD is calculable without access to the true score or training data and is strongly predictive of disagreements between samplers like DDPM and DDIM. Empirical results across multiple datasets show SD is pervasive and persists despite larger models or more data, while a theoretical framework attributes this to smoothing-based self-guidance across conditioning variables. Toy datasets and analytic results substantiate that self-guidance can cause interpolated flows to deviate from denoising, suggesting a fundamental bias in conditional diffusion that has implications for sampling, distillation, and the interpretation of diffusion-based methods.

Abstract

We study the inductive biases of diffusion models with a conditioning-variable, which have seen widespread application as both text-conditioned generative image models and observation-conditioned continuous control policies. We observe that when these models are queried conditionally, their generations consistently deviate from the idealized "denoising" process upon which diffusion models are formulated, inducing disagreement between popular sampling algorithms (e.g. DDPM, DDIM). We introduce Schedule Deviation, a rigorous measure which captures the rate of deviation from a standard denoising process, and provide a methodology to compute it. Crucially, we demonstrate that the deviation from an idealized denoising process occurs irrespective of the model capacity or amount of training data. We posit that this phenomenon occurs due to the difficulty of bridging distinct denoising flows across different parts of the conditioning space and show theoretically how such a phenomenon can arise through an inductive bias towards smoothness.

Paper Structure

This paper contains 33 sections, 21 theorems, 92 equations, 15 figures, 1 algorithm.

Key Result

Proposition 2.0

Adopt the setup of defn:IMCF. Then $v = v^{\textsc{imcf}} = \textrm{IMCF}(p_0)$ is given explicitly by any of the following identities where $c_1(s) := \dot{\alpha}(s) - \frac{\dot{\sigma}(s)}{\sigma(s)}\alpha(s)$, $c_2(s) := \frac{\dot{\sigma}(s)}{\sigma(s)}$, $\gamma_1(s) := \frac{\dot{\alpha}(s)}{\alpha(s)} \sigma(s)^2 - \dot{\sigma}(s)\sigma(s)$, $\gamma_2(s) := \frac{\dot{\alpha}(s)}{\alpha(

Figures (15)

  • Figure 1: We principally consider three datasets: conditional MNIST lecun1998gradient (left), conditional Fashion-MNIST xiao2017fashion (middle), and endpoint-conditional maze path generation (right). For MNIST and Fashion-MNIST we condition on the t-SNE embedding of the images (pictured above) as opposed to the classes as a proxy for text-embedding-conditioned image generation.
  • Figure 2: For conditioning values $z \sim \mathrm{Unif}(\mathcal{Z})$, we plot the Total Schedule Deviation (for $p_0$ sampled using DDPM) and optimal transport distance between DDPM/DDIM samples (as measured by $1$-Wasserstein/Earth-Mover-Distance), demonstrating that our prposed metric, Schedule Deviation, is indeed predictive of divergence between different samplers. In \ref{['app:experiments']} we demonstrate these trends hold across different choices of samplers and show additional experiments on attribute-conditional Celeb-A, where the conditioning space is more uniform.
  • Figure 3: For t-SNE-conditional MNIST generation, we evaluate the Schedule Deviation and empirical 1-Wassertstein Distance between DDPM/DDIM samples, ablated over the training dataset size $N \in \{10000, 30000, 60000\}$. We note strong structural similarity between the two metrics that appears related to the contours of the conditioning distribution and the conditional data distributions.
  • Figure 4: Analogous to \ref{['fig:mnist_heatmap']}, we show that Schedule Deviation is predictive of divergence between the DDPM/DDIM samplers for the trajectory (left) and Fashion-MNIST datasets (right). Note that the structure of the maze (shown in \ref{['fig:datasets']}) can clearly be observed in the Schedule Deviation. We defer full ablations over the training data for both to \ref{['app:experiments']}.
  • Figure 5: We visualize the test loss (left) and total schedule deviation (center left) for three different model capacities over the course of a training run. For the 13.3M parameter model, we show the effect of training dataset size on schedule deviation (center right), and, for the full dataset, the distribution of total schedule deviation across different classes (right). The median, 30th, and 70th percentile values are shown across the left three plots for sampled training batches and conditioning values.
  • ...and 10 more figures

Theorems & Definitions (45)

  • Definition 2.1: Probability Paths and Conditional Flows
  • Definition 2.2
  • Definition 2.3: Diffusion Probability Path
  • Definition 2.4: Ideal Model-Consistent Flow
  • Remark 2.1: Ideal Model-Consistent Flow vs Ground Truth Flow
  • Proposition 2.0
  • Definition 3.1: Schedule Deviation
  • Proposition 3.0
  • Theorem 1
  • Remark 3.1: Schedule Deviation v.s. Generation Fidelity
  • ...and 35 more