Table of Contents
Fetching ...

Is Noise Conditioning Necessary for Denoising Generative Models?

Qiao Sun, Zhicheng Jiang, Hanhong Zhao, Kaiming He

TL;DR

The paper challenges the widely held belief that noise conditioning is essential for denoising diffusion and related generative models. By reformulating training and sampling and analyzing effective targets, posterior concentration, and sampling error, the authors show that many models tolerate, or even benefit from, removing noise conditioning, with a theoretical bound predicting robustness. They introduce a noise-unconditional EDM variant (uEDM) that achieves competitive CIFAR-10 performance (FID ~2.23), narrowing the gap to conditioned baselines. Across extensive experiments on CIFAR-10, ImageNet, and FFHQ, the work reveals that noise conditioning is not a prerequisite for functionality and can inspire new unconditional modeling approaches and sampling strategies. The findings offer practical guidance for model design and suggest avenues for integrating physics-based Langevin dynamics and alternative training objectives without time conditioning.

Abstract

It is widely believed that noise conditioning is indispensable for denoising diffusion models to work successfully. This work challenges this belief. Motivated by research on blind image denoising, we investigate a variety of denoising-based generative models in the absence of noise conditioning. To our surprise, most models exhibit graceful degradation, and in some cases, they even perform better without noise conditioning. We provide a theoretical analysis of the error caused by removing noise conditioning and demonstrate that our analysis aligns with empirical observations. We further introduce a noise-unconditional model that achieves a competitive FID of 2.23 on CIFAR-10, significantly narrowing the gap to leading noise-conditional models. We hope our findings will inspire the community to revisit the foundations and formulations of denoising generative models.

Is Noise Conditioning Necessary for Denoising Generative Models?

TL;DR

The paper challenges the widely held belief that noise conditioning is essential for denoising diffusion and related generative models. By reformulating training and sampling and analyzing effective targets, posterior concentration, and sampling error, the authors show that many models tolerate, or even benefit from, removing noise conditioning, with a theoretical bound predicting robustness. They introduce a noise-unconditional EDM variant (uEDM) that achieves competitive CIFAR-10 performance (FID ~2.23), narrowing the gap to conditioned baselines. Across extensive experiments on CIFAR-10, ImageNet, and FFHQ, the work reveals that noise conditioning is not a prerequisite for functionality and can inspire new unconditional modeling approaches and sampling strategies. The findings offer practical guidance for model design and suggest avenues for integrating physics-based Langevin dynamics and alternative training objectives without time conditioning.

Abstract

It is widely believed that noise conditioning is indispensable for denoising diffusion models to work successfully. This work challenges this belief. Motivated by research on blind image denoising, we investigate a variety of denoising-based generative models in the absence of noise conditioning. To our surprise, most models exhibit graceful degradation, and in some cases, they even perform better without noise conditioning. We provide a theoretical analysis of the error caused by removing noise conditioning and demonstrate that our analysis aligns with empirical observations. We further introduce a noise-unconditional model that achieves a competitive FID of 2.23 on CIFAR-10, significantly narrowing the gap to leading noise-conditional models. We hope our findings will inspire the community to revisit the foundations and formulations of denoising generative models.

Paper Structure

This paper contains 70 sections, 5 theorems, 119 equations, 13 figures, 9 tables.

Key Result

Theorem 2

The original regression loss function with $t$ condition shown in eq:gs_loss with $w(t)=1$ is equivalent to the loss function with the effective target shown in eq:eff_loss_wt only up to a constant term that is independent of ${\bm{\theta}}$, where Here, $p({\mathbf{z}})$ is the marginalized distribution of ${\mathbf{z}}{:=}a(t){\mathbf{x}} + b(t){\bm{\epsilon}}$ in eq:z_cal, under the joint dist

Figures (13)

  • Figure 1: (a) A denoising generative model takes a noisy data ${\mathbf{z}}$ and a noise level indexed by $t$ (such as $\sigma_t$) as the inputs to the neural network ${\texttt{NN}}_{{\bm{\theta}}}$. (b) This work investigates the scenario of removing noise conditioning in the network.
  • Figure 2: Illustration of the effective target $R({\mathbf{z}})$. A given $z$ corresponds to multiple triplets $({\mathbf{x}}, {\bm{\epsilon}}, t)$. Here, we take Flow Matching lipman2023flow as an example. On the left are the samples of ${\bm{\epsilon}}$, and on the right are samples of ${\mathbf{x}}$. For a noisy sample ${\mathbf{z}} = (1-t){\mathbf{x}} + t{\bm{\epsilon}}$, it can be produced by different triplets. Each triplet gives a different regression target $r$. The effective target $R({\mathbf{z}})$ is the expectation of all possible $r$.
  • Figure 3: The Posterior distribution $p(t|{\mathbf{z}})$ is concentrated. We picked ${\mathbf{z}} = (1-t_*){\mathbf{x}} + t_*{\bm{\epsilon}}$ with $t_*$ from 0.1 to 0.9 for illustration. This plot is empirically simulated from 15,000 images in the AFHQ-v2 dataset with a size $64{\times} 64$ (see \ref{['app:numerical']}).
  • Figure 4: Error bound and the influence of noise conditioning. ODE with $N=100$ steps is applied for each variant. The plot shows the per-step error bound $A_i B_i$ in \ref{['eq:bound']}, and the table shows the accumulated error bound. The y-axis is log-scale.
  • Figure 5: Samples of noise-conditional vs. noise-unconditional models. Samples are generated by (a) DDIM, (b) EDM, (c) FM (1-RF), and (d) uEDM, on the CIFAR-10 class-unconditional case. For each subfigure, the left panel is the noise-conditional case, and the right panel is the noise-unconditional counterpart, with the same random seeds. The change of FID is from "w/ $t$" to "w/o $t$". See also \ref{['tab:exp']} for more quantitative results.
  • ...and 8 more figures

Theorems & Definitions (11)

  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • Theorem 4
  • proof
  • Lemma 1
  • proof
  • Theorem 5
  • proof
  • ...and 1 more