Table of Contents
Fetching ...

Diffusion Models With Learned Adaptive Noise

Subham Sekhar Sahoo, Aaron Gokaslan, Chris De Sa, Volodymyr Kuleshov

TL;DR

This paper tackles the question of whether the diffusion process in diffusion models can be learned from data to improve probabilistic modeling. It introduces MuLAN, a multivariate learned adaptive noise process that conditions noise application on per-pixel context and uses an auxiliary latent variable to enable joint learning of forward and reverse diffusion. By reframing the forward process as a learnable variational posterior, MuLAN tightens the ELBO and achieves state-of-the-art density estimation on CIFAR-10 and ImageNet-32 while reducing training time by about half. The work provides extensive ablations and analyses, showing that the learned noise schedule, especially its polynomial per-pixel form and auxiliary latent conditioning, is crucial for performance gains and that the approach is compatible with existing diffusion architectures without modifying the underlying denoising network.

Abstract

Diffusion models have gained traction as powerful algorithms for synthesizing high-quality images. Central to these algorithms is the diffusion process, a set of equations which maps data to noise in a way that can significantly affect performance. In this paper, we explore whether the diffusion process can be learned from data. Our work is grounded in Bayesian inference and seeks to improve log-likelihood estimation by casting the learned diffusion process as an approximate variational posterior that yields a tighter lower bound (ELBO) on the likelihood. A widely held assumption is that the ELBO is invariant to the noise process: our work dispels this assumption and proposes multivariate learned adaptive noise (MULAN), a learned diffusion process that applies noise at different rates across an image. Specifically, our method relies on a multivariate noise schedule that is a function of the data to ensure that the ELBO is no longer invariant to the choice of the noise schedule as in previous works. Empirically, MULAN sets a new state-of-the-art in density estimation on CIFAR-10 and ImageNet and reduces the number of training steps by 50%. We provide the code, along with a blog post and video tutorial on the project page: https://s-sahoo.com/MuLAN

Diffusion Models With Learned Adaptive Noise

TL;DR

This paper tackles the question of whether the diffusion process in diffusion models can be learned from data to improve probabilistic modeling. It introduces MuLAN, a multivariate learned adaptive noise process that conditions noise application on per-pixel context and uses an auxiliary latent variable to enable joint learning of forward and reverse diffusion. By reframing the forward process as a learnable variational posterior, MuLAN tightens the ELBO and achieves state-of-the-art density estimation on CIFAR-10 and ImageNet-32 while reducing training time by about half. The work provides extensive ablations and analyses, showing that the learned noise schedule, especially its polynomial per-pixel form and auxiliary latent conditioning, is crucial for performance gains and that the approach is compatible with existing diffusion architectures without modifying the underlying denoising network.

Abstract

Diffusion models have gained traction as powerful algorithms for synthesizing high-quality images. Central to these algorithms is the diffusion process, a set of equations which maps data to noise in a way that can significantly affect performance. In this paper, we explore whether the diffusion process can be learned from data. Our work is grounded in Bayesian inference and seeks to improve log-likelihood estimation by casting the learned diffusion process as an approximate variational posterior that yields a tighter lower bound (ELBO) on the likelihood. A widely held assumption is that the ELBO is invariant to the noise process: our work dispels this assumption and proposes multivariate learned adaptive noise (MULAN), a learned diffusion process that applies noise at different rates across an image. Specifically, our method relies on a multivariate noise schedule that is a function of the data to ensure that the ELBO is no longer invariant to the choice of the noise schedule as in previous works. Empirically, MULAN sets a new state-of-the-art in density estimation on CIFAR-10 and ImageNet and reduces the number of training steps by 50%. We provide the code, along with a blog post and video tutorial on the project page: https://s-sahoo.com/MuLAN
Paper Structure (88 sections, 83 equations, 13 figures, 8 tables)

This paper contains 88 sections, 83 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: (Left) Comparison of noise schedule properties: Multivariate Learned Adaptive Noise schedule (MuLAN) (ours) versus a typical scalar noise schedule. Unlike scalar noise schedules, MuLAN’s multivariate and input-adaptive properties improve likelihood. (Right) Likelihood in bits-per-dimension (BPD) on CIFAR-10 without data augmentation.
  • Figure 2: Ablating components of MuLAN on CIFAR-10 over 2.5M steps with batch size of 64.
  • Figure 3: Noise schedule visualizations for MuLAN on CIFAR-10. In this figure, we plot the variance of $\text{$\bm \nu$}_\phi({\mathbf z}, t)$ across different ${\mathbf z} \sim {p_\theta}({\mathbf z})$ where each curve represents the SNR corresponding to an input dimension.
  • Figure 4: For ${\mathbf c}=$ "class labels" or ${\mathbf c}={\mathbf x}_0$ the likelihood estimates are worse than VDM. For ${\mathbf c}={\mathbf x}_0$, we see that the VLB degrades with increasing T, but for VDM and MuLAN, it improves with increasing T. This empirical observation is consistent with our mathematical insights earlier. As these models consistently exhibit inferior performance w.r.t VDM, in line with our initial conjectures, we refrain from training them beyond 300k iterations due to the substantial computational cost involved.
  • Figure 5: (a) Imagine piloting a plane across a region with cyclones and strong winds, as shown in Fig. \ref{['fig:intuitive-explanation']}. Plotting a direct, straight-line course through these adverse weather conditions requires more fuel and effort due to increased resistance. By navigating around the cyclones and winds, however, the plane reaches its destination with less energy, even if the route is longer.This intuition translates into mathematical and physical terms. The plane’s trajectory is denoted by $\mathbf{r}(t) \in \mathbb{R}^n_{+}$, while the forces acting on it are represented by $\mathbf{f}(\mathbf{r}(t)) \in \mathbb{R}^n$. The work required to navigate is given by $\int_{0}^{1} \mathbf{f}(\mathbf{r}(t)) \cdot \frac{d}{dt}\mathbf{r}(t) , dt$. Here, the work depends on the trajectory because $\mathbf{f}(\mathbf{r}(t))$ is not a conservative field. (b) This concept also applies to the diffusion NELBO. From Eq. \ref{['eqn:dot_product']}, it’s clear that the trajectory $\mathbf{r}(t)$ is parameterized by the noise schedule $\text{$\bm \nu$}({\mathbf z}, t)$, which is influenced by complex forces, ${\mathbf f}$ (analogous to weather patterns), represented by the dimension-wise reconstruction error of the denoising model, $({\mathbf x}_0 - {\mathbf x}_\theta({\mathbf x}_t, {\mathbf z}, t))^2$. Thus, the diffusion loss, ${\mathcal{L}_{\text{diffusion}}}$, can be interpreted as the work done along the trajectory $\text{$\bm \nu$}({\mathbf z}, t)$ in the presence of these vector field forces ${\mathbf f}$. By learning the noise schedule, we can avoid “high-resistance” paths (those where the loss accumulates rapidly), thereby minimizing the overall “energy” expended, as measured by the NELBO.
  • ...and 8 more figures