Table of Contents
Fetching ...

Optimizing Noise Schedules of Generative Models in High Dimensionss

Santiago Aranguri, Giulio Biroli, Marc Mezard, Eric Vanden-Eijnden

TL;DR

This work analyzes noise schedules for high-dimensional diffusion-based generative models through the lens of stochastic interpolants, revealing a fundamental VP/VE dichotomy: VP tends to recover low-level per-mode structure while VE captures high-level inter-mode asymmetry. Time-dilated interpolation schedules are proposed to jointly recover both types of features, yielding a well-defined limiting probability-flow ODE in dimension $d$ that can be discretized with $\Theta_d(1)$ steps. The authors establish theory for Gaussian Mixtures and Curie-Weiss distributions, connect the interpolants to standard score-based diffusion models, and validate the approach with GM/CW simulations and CelebA experiments, showing improved feature recovery and discretization efficiency. Practically, the results provide dimension-robust noise schedules that enable efficient sampling in high-dimensional diffusion models while preserving both global structure and fine-grained details.

Abstract

Recent works have shown that diffusion models can undergo phase transitions, the resolution of which is needed for accurately generating samples. This has motivated the use of different noise schedules, the two most common choices being referred to as variance preserving (VP) and variance exploding (VE). Here we revisit these schedules within the framework of stochastic interpolants. Using the Gaussian Mixture (GM) and Curie-Weiss (CW) data distributions as test case models, we first investigate the effect of the variance of the initial noise distribution and show that VP recovers the low-level feature (the distribution of each mode) but misses the high-level feature (the asymmetry between modes), whereas VE performs oppositely. We also show that this dichotomy, which happens when denoising by a constant amount in each step, can be avoided by using noise schedules specific to VP and VE that allow for the recovery of both high- and low-level features. Finally we show that these schedules yield generative models for the GM and CW model whose probability flow ODE can be discretized using $Θ_d(1)$ steps in dimension $d$ instead of the $Θ_d(\sqrt{d})$ steps required by constant denoising.

Optimizing Noise Schedules of Generative Models in High Dimensionss

TL;DR

This work analyzes noise schedules for high-dimensional diffusion-based generative models through the lens of stochastic interpolants, revealing a fundamental VP/VE dichotomy: VP tends to recover low-level per-mode structure while VE captures high-level inter-mode asymmetry. Time-dilated interpolation schedules are proposed to jointly recover both types of features, yielding a well-defined limiting probability-flow ODE in dimension that can be discretized with steps. The authors establish theory for Gaussian Mixtures and Curie-Weiss distributions, connect the interpolants to standard score-based diffusion models, and validate the approach with GM/CW simulations and CelebA experiments, showing improved feature recovery and discretization efficiency. Practically, the results provide dimension-robust noise schedules that enable efficient sampling in high-dimensional diffusion models while preserving both global structure and fine-grained details.

Abstract

Recent works have shown that diffusion models can undergo phase transitions, the resolution of which is needed for accurately generating samples. This has motivated the use of different noise schedules, the two most common choices being referred to as variance preserving (VP) and variance exploding (VE). Here we revisit these schedules within the framework of stochastic interpolants. Using the Gaussian Mixture (GM) and Curie-Weiss (CW) data distributions as test case models, we first investigate the effect of the variance of the initial noise distribution and show that VP recovers the low-level feature (the distribution of each mode) but misses the high-level feature (the asymmetry between modes), whereas VE performs oppositely. We also show that this dichotomy, which happens when denoising by a constant amount in each step, can be avoided by using noise schedules specific to VP and VE that allow for the recovery of both high- and low-level features. Finally we show that these schedules yield generative models for the GM and CW model whose probability flow ODE can be discretized using steps in dimension instead of the steps required by constant denoising.
Paper Structure (18 sections, 15 theorems, 185 equations, 5 figures)

This paper contains 18 sections, 15 theorems, 185 equations, 5 figures.

Key Result

Lemma 1

Let $X_{\tau}$ solve the probability flow ODE If $X_{0}\sim \mathcal{N}(0,c^2\text{Id}_{d})$, then $X_{\tau}\stackrel{d}{=}I_{\tau}$ for all $\tau\in[0,1]$, and in particular, $X_{\tau=1}\sim\mu$.

Figures (5)

  • Figure 1: (Left panel): We consider the time dilation used by ho2020denoising where $\tau_t = \exp\left(\gamma_{\min}\ln t -(\gamma_{\max}-\gamma_{\min})(\ln t)^2/2\right)$ for $\gamma_{\max}=20$ and $\gamma_{\min}=0.1$ and compare it with the time dilation \ref{['eq:vp:time_dil']} used in our analysis, with $d=256^2$ (since ho2020denoising works with $256\times256$ images) and $\kappa=3.$ Since the VP SDE is run til $s=1,$ the time dilation from ho2020denoising is only used in $t\in[1/e, 1].$(Right panel): We plot the magnitude $\alpha_t$ of the noise of the dilated VE interpolant, $\alpha_t=\sqrt{d}(1-\tau_t)$ with $\tau_t$ defined in \ref{['eq:t_dil:ve']}. We also plot the magnitude of the noise for the VE SDE $\alpha_t=\sqrt{\sigma^2_{1-t} - \sigma^2_0}$ from song2021scorebasedgenerativemodelingstochastic.
  • Figure 2: We run $100$ realizations $(X^{(j)}_t)_{j=1}^{100}$ of the probability flow ODE \ref{['eq:ode']} associated with the dilated VE interpolant for the GM distribution, uniformly discretized with step size $d=10^6, \Delta t=0.01, \kappa=3,$$\sigma^2=1/4,$ and $p=0.8.$ For each realization, we plot in the top panel$M^{(j)}_t=r\cdot X_t^{(j)}/d.$ We then take a single realization $X^{(1)}_t$ and plot in the middle panel the trajectory of the coordinates $(X^{(1)}_t)^i$ for $i=1, \cdots, 500$ in the second phase $t\in[1/2, 1].$ We do not plot $t\in[0,1/2]$ since $X^{(1)}_t$ is of order $\sqrt{d}$ there and would clutter the plot. For $t$ close to $1,$ the trajectories converge to a Gaussian centered at $1$ indicated by the dashed line. We also run the probability flow ODE associated with the dilated VE interpolant generating samples from the CW distribution with $d,\Delta t, \kappa$ as for the GM distribution and $\beta=2$. We then plot in the bottom panel the trajectories of the first $500$ coordinates of one realization, with dashed lines at $\pm 1.$
  • Figure 3: We plot, for different number of discretization steps and for the VP/VE SDEs, the KL divergence between the race distribution of a set of $7,500$ generated images and the race distribution of the original dataset in the top panel (high-level feature). In the bottom panel (low-level feature), we plot the percentage of the generated images that are classified as not containing a face.
  • Figure 4: For different number of discretization steps, we include images generated by the VP SDE from song2021scorebasedgenerativemodelingstochastic pretrained on the CelebA-HQ dataset huggingface_ddpm_celebahq_256. We see that for small number of steps, the samples look alike, and diversity increases with the number of steps.
  • Figure 5: or different number of discretization steps, we show images generated by the VE SDE from song2021scorebasedgenerativemodelingstochastic pretrained on the CelebA-HQ dataset huggingface_ncsnpp_celebahq_256. Samples with small number of steps are lacking in quality, but not in diversity. As we increase the number of steps, the quality improves.

Theorems & Definitions (25)

  • Lemma 1
  • Proposition 1: VP only captures $\sigma^2$
  • Proposition 2: VE only captures $p$
  • Theorem 1: Dilated VP captures $p$ and $\sigma^2$
  • Theorem 2: Dilated VE captures $p$ and $\sigma^2$
  • Theorem 3: Dilated VE captures both features for CW
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • ...and 15 more