Table of Contents
Fetching ...

Diffusion Models Generate Images Like Painters: an Analytical Theory of Outline First, Details Later

Binxu Wang, John J. Vastola

TL;DR

This work analyzes how diffusion models convert noise into structured images, showing that reverse-diffusion trajectories are effectively 2D rotations toward an evolving endpoint on the image manifold and that high-variance features emerge early while low-variance details accumulate later. It derives an exact analytic solution for the Gaussian-score model's probability-flow ODE, validates it on pretrained models, and demonstrates practical speedups by teleporting through early steps. Extending beyond Gaussian scores, the authors show that diffusion trajectories resemble retrieval in a general point-cloud setting, with endpoint estimates remaining on the data manifold and converging toward nearby training samples. The results offer a unified view linking diffusion dynamics, GAN-like outline-first generation, and manifold geometry, with concrete methods for accelerating sampling and mapping the image-manifold in complex models such as Stable Diffusion.

Abstract

How do diffusion generative models convert pure noise into meaningful images? In a variety of pretrained diffusion models (including conditional latent space models like Stable Diffusion), we observe that the reverse diffusion process that underlies image generation has the following properties: (i) individual trajectories tend to be low-dimensional and resemble 2D `rotations'; (ii) high-variance scene features like layout tend to emerge earlier, while low-variance details tend to emerge later; and (iii) early perturbations tend to have a greater impact on image content than later perturbations. To understand these phenomena, we derive and study a closed-form solution to the probability flow ODE for a Gaussian distribution, which shows that the reverse diffusion state rotates towards a gradually-specified target on the image manifold. It also shows that generation involves first committing to an outline, and then to finer and finer details. We find that this solution accurately describes the initial phase of image generation for pretrained models, and can in principle be used to make image generation more efficient by skipping reverse diffusion steps. Finally, we use our solution to characterize the image manifold in Stable Diffusion. Our viewpoint reveals an unexpected similarity between generation by GANs and diffusion and provides a conceptual link between diffusion and image retrieval.

Diffusion Models Generate Images Like Painters: an Analytical Theory of Outline First, Details Later

TL;DR

This work analyzes how diffusion models convert noise into structured images, showing that reverse-diffusion trajectories are effectively 2D rotations toward an evolving endpoint on the image manifold and that high-variance features emerge early while low-variance details accumulate later. It derives an exact analytic solution for the Gaussian-score model's probability-flow ODE, validates it on pretrained models, and demonstrates practical speedups by teleporting through early steps. Extending beyond Gaussian scores, the authors show that diffusion trajectories resemble retrieval in a general point-cloud setting, with endpoint estimates remaining on the data manifold and converging toward nearby training samples. The results offer a unified view linking diffusion dynamics, GAN-like outline-first generation, and manifold geometry, with concrete methods for accelerating sampling and mapping the image-manifold in complex models such as Stable Diffusion.

Abstract

How do diffusion generative models convert pure noise into meaningful images? In a variety of pretrained diffusion models (including conditional latent space models like Stable Diffusion), we observe that the reverse diffusion process that underlies image generation has the following properties: (i) individual trajectories tend to be low-dimensional and resemble 2D `rotations'; (ii) high-variance scene features like layout tend to emerge earlier, while low-variance details tend to emerge later; and (iii) early perturbations tend to have a greater impact on image content than later perturbations. To understand these phenomena, we derive and study a closed-form solution to the probability flow ODE for a Gaussian distribution, which shows that the reverse diffusion state rotates towards a gradually-specified target on the image manifold. It also shows that generation involves first committing to an outline, and then to finer and finer details. We find that this solution accurately describes the initial phase of image generation for pretrained models, and can in principle be used to make image generation more efficient by skipping reverse diffusion steps. Finally, we use our solution to characterize the image manifold in Stable Diffusion. Our viewpoint reveals an unexpected similarity between generation by GANs and diffusion and provides a conceptual link between diffusion and image retrieval.
Paper Structure (62 sections, 88 equations, 28 figures, 6 tables)

This paper contains 62 sections, 88 equations, 28 figures, 6 tables.

Figures (28)

  • Figure 1: Characteristics of image generation by diffusion models. A. Tracking latent states $G(\mathbf{x}_t)$ (top row), differences between nearby time steps $G(k (\mathbf{x}_{t-1} - \mathbf{x}_{t}))$ (middle row), and final image estimates $G(\hat{\mathbf{x}}_0(\mathbf{x}_t))$ (bottom row) suggests different measures of progress. B. Individual trajectories are effectively two-dimensional, with the transition from $\mathbf{x}_T$ to $\mathbf{x}_0$ being rotation-like.
  • Figure 2: Geometry of single mode reverse diffusion.
  • Figure 3: Analytical solution to diffusion dynamics in Gaussian caseA.$\psi(t,\lambda)$ governs the dynamic of state $\mathbf{x}_t$ along each each principal axis $\mathbf{u}_k$B.$\xi(t,\lambda)$ governs the dynamics of endpoint estimate $\hat{\mathbf{x}}_0(\mathbf{x}_t)$ along each PC, normalized by the standard deviation $\sqrt{\lambda_k}$. C. Time derivative of $\xi(t,\lambda)/\sqrt{\lambda}$, highlighting the 'critical period' when the feature develops. D.$\sqrt{\lambda/(\sigma_{t'}^2+\lambda\alpha_{t'}^2)}$, which quantify the amplification effect of a perturbation along PC $\mathbf{u}_k$ at time $t'$ (Eq.\ref{['eq:y_perturb_formula']}). We used the $\alpha_t$ schedule from ddpm-CIFAR-10.
  • Figure 4: Comparing analytical solution to DDIM sampling for CIFAR-10 diffusion model. A.$\hat{\mathbf{x}}_0(\mathbf{x}_t)$ of a DDIM trajectory and the Gaussian solution with the same initial condition $\mathbf{x}_T$. B. Samples generated by DDIM and the analytical theories from the same initial condition. C. Mean squared error between the $\mathbf{x}_t$ trajectory of DDIM and Gaussian solution. D. Comparing the state trajectory and final sample of three normative models (Gaussian, GMM, exact) with DDIM. E. Hybrid sampling method combines Gaussian theory prediction with DDIM. F. Image quality of the hybrid method (FID score) as a function of different numbers of skipped steps for EDM model and sampler karras2022elucidatingDesignSp (see Appendix \ref{['apd:fid_method']}).
  • Figure 5: Stable Diffusion: Local manifold map. A. PCs of the projected outcome trajectory $\hat{\mathbf{x}}_0(\mathbf{x}_t)$ are on-manifold. B. Trajectory of endpoint estimate $G(\hat{\mathbf{x}}_0(\mathbf{x}_t))$C. Perturbation by PC2 and PC3; notice an apple morphing into a teddy bear. D. Perturbing trajectory along PC2 or PC3 during reverse diffusion. Rows: different perturbation times. Columns: different magnitudes.
  • ...and 23 more figures