Table of Contents
Fetching ...

Optimal Stopping in Latent Diffusion Models

Yu-Han Wu, Quentin Berthet, Gérard Biau, Claire Boyer, Romuald Elie, Pierre Marion

TL;DR

This work analyzes how latent dimensionality in Latent Diffusion Models interacts with the backward-diffusion stopping time to affect sample quality. Using a Gaussian model with a linear autoencoder, it derives how the Fréchet/Wasserstein-2 distance between the data and generated distributions evolves and shows a time-dependent trade-off where low latent dimensions benefit from earlier stopping and higher dimensions require later stopping. It further develops results for score-matching ERMs under norm constraints and extends the findings to general Gaussian covariances, illustrating that PCA-like projections are optimal on certain time intervals. The results offer a principled guideline for choosing latent dimension and stopping time to optimize generation quality while managing computation, supported by experiments on real data such as CelebA.

Abstract

We identify and analyze a surprising phenomenon of Latent Diffusion Models (LDMs) where the final steps of the diffusion can degrade sample quality. In contrast to conventional arguments that justify early stopping for numerical stability, this phenomenon is intrinsic to the dimensionality reduction in LDMs. We provide a principled explanation by analyzing the interaction between latent dimension and stopping time. Under a Gaussian framework with linear autoencoders, we characterize the conditions under which early stopping is needed to minimize the distance between generated and target distributions. More precisely, we show that lower-dimensional representations benefit from earlier termination, whereas higher-dimensional latent spaces require later stopping time. We further establish that the latent dimension interplays with other hyperparameters of the problem such as constraints in the parameters of score matching. Experiments on synthetic and real datasets illustrate these properties, underlining that early stopping can improve generative quality. Together, our results offer a theoretical foundation for understanding how the latent dimension influences the sample quality, and highlight stopping time as a key hyperparameter in LDMs.

Optimal Stopping in Latent Diffusion Models

TL;DR

This work analyzes how latent dimensionality in Latent Diffusion Models interacts with the backward-diffusion stopping time to affect sample quality. Using a Gaussian model with a linear autoencoder, it derives how the Fréchet/Wasserstein-2 distance between the data and generated distributions evolves and shows a time-dependent trade-off where low latent dimensions benefit from earlier stopping and higher dimensions require later stopping. It further develops results for score-matching ERMs under norm constraints and extends the findings to general Gaussian covariances, illustrating that PCA-like projections are optimal on certain time intervals. The results offer a principled guideline for choosing latent dimension and stopping time to optimize generation quality while managing computation, supported by experiments on real data such as CelebA.

Abstract

We identify and analyze a surprising phenomenon of Latent Diffusion Models (LDMs) where the final steps of the diffusion can degrade sample quality. In contrast to conventional arguments that justify early stopping for numerical stability, this phenomenon is intrinsic to the dimensionality reduction in LDMs. We provide a principled explanation by analyzing the interaction between latent dimension and stopping time. Under a Gaussian framework with linear autoencoders, we characterize the conditions under which early stopping is needed to minimize the distance between generated and target distributions. More precisely, we show that lower-dimensional representations benefit from earlier termination, whereas higher-dimensional latent spaces require later stopping time. We further establish that the latent dimension interplays with other hyperparameters of the problem such as constraints in the parameters of score matching. Experiments on synthetic and real datasets illustrate these properties, underlining that early stopping can improve generative quality. Together, our results offer a theoretical foundation for understanding how the latent dimension influences the sample quality, and highlight stopping time as a key hyperparameter in LDMs.

Paper Structure

This paper contains 36 sections, 13 theorems, 109 equations, 6 figures, 3 tables.

Key Result

Proposition 1

Let $P_d\overleftarrow{X}_{t}$ and $P_d\overleftarrow{\hat{X}}_{t}$ be given as in eq:SDE-backward and eq:SDE-backward-estimated, respectively. For $d\in\{1,\hdots,D\}$, the Fréchet distance $d_F(P_d^\top P_d\overleftarrow{X}_{t}, \overrightarrow{X}_0)$ is non-increasing with respect to $t$. On the

Figures (6)

  • Figure 1: (left) FID-30k score of latent diffusion model on CelebA-HQ, with latent shape $64\times64\times3$. (right) FID-30k score of standard diffusion model (trained in pixel space) on CelebA64 ($64\times64\times3$).
  • Figure 2: Samples generated with a latent diffusion model (LDM) and a pixel-space diffusion. In the LDM, the before-last sample is nearly denoised and indistinguishable from the final one, whereas in the pixel-space model stronger noise remains at that timestep. See Appendix \ref{['app:exp']} for more examples.
  • Figure 3: $\bar{a}^{-2}$ in the Ornstein-Uhlenbeck process.
  • Figure 4: Plots of $d_F^2(P_d^\top P_d\overleftarrow{\hat{X}}_{t}, \overrightarrow{X_{0}})$ as a function of the diffusion time $t$, for two sets of variances $(\sigma_1, \dots, \sigma_D)$. (left) All the $\sigma_i$ are nonzero. As expected from Proposition \ref{['prop:min_wasserstein_projected']}, the $d$-dimensional projection is optimal in $[t_d, t_{d+1})$. (right) The data is supported on a linear subspace of dimension $d_0=4$ with $D=6$. As expected from Proposition \ref{['cor:isotropic_case-optimal-stopping-time']}, we observe that the minimum distance is achieved in dimension $d_0$ and with early stopping. LogSNR in the $x$-axis is a remapping of time $t$, defined as $\log(b_t^2/a_t^2)$, which we use to increase readability. Experimental details are in Appendix \ref{['app:exp']}.
  • Figure 5: The final steps of LDM do not improve image quality.
  • ...and 1 more figures

Theorems & Definitions (15)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Proposition 5
  • Corollary 1
  • Proposition 6
  • Proposition 7
  • Proposition 8
  • Lemma 1
  • ...and 5 more