Table of Contents
Fetching ...

Interpreting and Improving Diffusion Models from an Optimization Perspective

Frank Permenter, Chenyang Yuan

TL;DR

This work reframes diffusion models through an optimization lens by treating denoising as approximate projection onto the data manifold and diffusion as gradient descent on the squared distance to that manifold. It introduces an $(\eta,\nu)$-approximate projection model and analyzes DDIM under this framework, yielding convergence guarantees and guiding principles for noise schedules. A gradient-estimation sampler is proposed to reduce gradient-estimation error by aggregating prior denoiser outputs, achieving state-of-the-art FID with few evaluations on CIFAR-10, CelebA, and latent-diffusion systems. The findings connect diffusion, projection, and distance-function concepts, enabling new deterministic samplers and conditioning strategies with practical impact for fast, high-quality image generation. Overall, the approach offers a cohesive, theory-driven path to design and analyze diffusion samplers that generalize beyond standard DDIM/DDPM implementations.

Abstract

Denoising is intuitively related to projection. Indeed, under the manifold hypothesis, adding random noise is approximately equivalent to orthogonal perturbation. Hence, learning to denoise is approximately learning to project. In this paper, we use this observation to interpret denoising diffusion models as approximate gradient descent applied to the Euclidean distance function. We then provide straight-forward convergence analysis of the DDIM sampler under simple assumptions on the projection error of the denoiser. Finally, we propose a new gradient-estimation sampler, generalizing DDIM using insights from our theoretical results. In as few as 5-10 function evaluations, our sampler achieves state-of-the-art FID scores on pretrained CIFAR-10 and CelebA models and can generate high quality samples on latent diffusion models.

Interpreting and Improving Diffusion Models from an Optimization Perspective

TL;DR

This work reframes diffusion models through an optimization lens by treating denoising as approximate projection onto the data manifold and diffusion as gradient descent on the squared distance to that manifold. It introduces an -approximate projection model and analyzes DDIM under this framework, yielding convergence guarantees and guiding principles for noise schedules. A gradient-estimation sampler is proposed to reduce gradient-estimation error by aggregating prior denoiser outputs, achieving state-of-the-art FID with few evaluations on CIFAR-10, CelebA, and latent-diffusion systems. The findings connect diffusion, projection, and distance-function concepts, enabling new deterministic samplers and conditioning strategies with practical impact for fast, high-quality image generation. Overall, the approach offers a cohesive, theory-driven path to design and analyze diffusion samplers that generalize beyond standard DDIM/DDPM implementations.

Abstract

Denoising is intuitively related to projection. Indeed, under the manifold hypothesis, adding random noise is approximately equivalent to orthogonal perturbation. Hence, learning to denoise is approximately learning to project. In this paper, we use this observation to interpret denoising diffusion models as approximate gradient descent applied to the Euclidean distance function. We then provide straight-forward convergence analysis of the DDIM sampler under simple assumptions on the projection error of the denoiser. Finally, we propose a new gradient-estimation sampler, generalizing DDIM using insights from our theoretical results. In as few as 5-10 function evaluations, our sampler achieves state-of-the-art FID scores on pretrained CIFAR-10 and CelebA models and can generate high quality samples on latent diffusion models.
Paper Structure (55 sections, 23 theorems, 75 equations, 11 figures, 4 tables, 2 algorithms)

This paper contains 55 sections, 23 theorems, 75 equations, 11 figures, 4 tables, 2 algorithms.

Key Result

Proposition 2.1

Suppose $\mathcal{K} \subseteq \mathbb{R}^n$ is closed and $x \notin \mathcal{K}$. Then ${\rm proj}_{\mathcal{K}}(x)$ is unique for almost all $x \in \mathbb{R}^n$ (under the Lebesgue measure). If ${\rm proj}_{\mathcal{K}}(x)$ is unique, then $\nabla {\rm dist}_{\mathcal{K}}(x)$ exists, $\|\nabla {\

Figures (11)

  • Figure 1: Denoising approximates projection: When $\sigma$ is small (\ref{['fig:proj-low-noise']}), most of the added noise lies in $\tan_\mathcal{K}(x_0)^\perp$ with high probability under the manifold hypothesis. When $\sigma$ is large (\ref{['fig:proj-high-noise']}), both denoising and projection point in the same direction towards $\mathcal{K}$. We interpret the denoising process (\ref{['fig:proj-denoise']}) as minimizing ${\rm dist}_{\mathcal{K}}^2(x)$ by iteratively taking gradient steps, estimating the direction of $\nabla \frac{1}{2}{\rm dist}_{\mathcal{K}}^2(x) = x_t - \mathop{\mathrm{proj}}\limits_\mathcal{K}(x_t)$ with $\epsilon_\theta(x_t)$.
  • Figure 2: Ideal denoiser well-approximates projection onto the CIFAR-10 dataset. Dashed line plots error for the example shown, and density plot shows the error distribution over 10k different DDIM sampling trajectories.
  • Figure 3: Illustration of our choice of $\bar{\epsilon}_t$
  • Figure 4: Outputs of our gradient-estimation sampler on text-to-image Stable Diffusion compared to other commonly used samplers, when limited to $N = 10$ function evaluations. We also report FID scores for text-to-image generation on MS-COCO 30K.
  • Figure 5: Plot of different choices of $\log(\sigma_t)$ for $N=10$.
  • ...and 6 more figures

Theorems & Definitions (35)

  • Proposition 2.1: page 283, Theorem 3.3 of delfour2011shapes
  • Definition 3.1
  • Lemma 3.1: Theorem 4.8(12) in federer1959curvature
  • Proposition 3.1: Oracle denoising (informal)
  • Proposition 3.2
  • Proposition 3.3
  • Proposition 3.4
  • Theorem 4.1
  • Lemma 4.1
  • Definition 4.1
  • ...and 25 more