Table of Contents
Fetching ...

Lipschitz Singularities in Diffusion Models

Zhantao Yang, Ruili Feng, Han Zhang, Yujun Shen, Kai Zhu, Lianghua Huang, Yifei Zhang, Yu Liu, Deli Zhao, Jingren Zhou, Fan Cheng

TL;DR

This work identifies and theoretically proves that diffusion models can exhibit infinite Lipschitz constants with respect to the time variable near the zero point $t=0$, threatening stability during training and sampling. To address this, the authors propose E-TSDM, which shares timestep conditions within a near-zero interval by partitioning $[0, ilde{t})$ into $n$ sub-intervals and using a fixed left-end timestep per sub-interval, thereby reducing Lipschitz constants without altering the forward process or network architecture. Empirically, E-TSDM delivers consistent improvements over DDPM baselines across unconditional and conditional generation, as well as faster sampling scenarios, and shows favorable generalization to continuous-time diffusion and various noise schedules. These results offer both a deeper understanding of the diffusion process and a practical method to enhance stability and performance in diffusion-based generation systems.

Abstract

Diffusion models, which employ stochastic differential equations to sample images through integrals, have emerged as a dominant class of generative models. However, the rationality of the diffusion process itself receives limited attention, leaving the question of whether the problem is well-posed and well-conditioned. In this paper, we explore a perplexing tendency of diffusion models: they often display the infinite Lipschitz property of the network with respect to time variable near the zero point. We provide theoretical proofs to illustrate the presence of infinite Lipschitz constants and empirical results to confirm it. The Lipschitz singularities pose a threat to the stability and accuracy during both the training and inference processes of diffusion models. Therefore, the mitigation of Lipschitz singularities holds great potential for enhancing the performance of diffusion models. To address this challenge, we propose a novel approach, dubbed E-TSDM, which alleviates the Lipschitz singularities of the diffusion model near the zero point of timesteps. Remarkably, our technique yields a substantial improvement in performance. Moreover, as a byproduct of our method, we achieve a dramatic reduction in the Fréchet Inception Distance of acceleration methods relying on network Lipschitz, including DDIM and DPM-Solver, by over 33%. Extensive experiments on diverse datasets validate our theory and method. Our work may advance the understanding of the general diffusion process, and also provide insights for the design of diffusion models.

Lipschitz Singularities in Diffusion Models

TL;DR

This work identifies and theoretically proves that diffusion models can exhibit infinite Lipschitz constants with respect to the time variable near the zero point , threatening stability during training and sampling. To address this, the authors propose E-TSDM, which shares timestep conditions within a near-zero interval by partitioning into sub-intervals and using a fixed left-end timestep per sub-interval, thereby reducing Lipschitz constants without altering the forward process or network architecture. Empirically, E-TSDM delivers consistent improvements over DDPM baselines across unconditional and conditional generation, as well as faster sampling scenarios, and shows favorable generalization to continuous-time diffusion and various noise schedules. These results offer both a deeper understanding of the diffusion process and a practical method to enhance stability and performance in diffusion-based generation systems.

Abstract

Diffusion models, which employ stochastic differential equations to sample images through integrals, have emerged as a dominant class of generative models. However, the rationality of the diffusion process itself receives limited attention, leaving the question of whether the problem is well-posed and well-conditioned. In this paper, we explore a perplexing tendency of diffusion models: they often display the infinite Lipschitz property of the network with respect to time variable near the zero point. We provide theoretical proofs to illustrate the presence of infinite Lipschitz constants and empirical results to confirm it. The Lipschitz singularities pose a threat to the stability and accuracy during both the training and inference processes of diffusion models. Therefore, the mitigation of Lipschitz singularities holds great potential for enhancing the performance of diffusion models. To address this challenge, we propose a novel approach, dubbed E-TSDM, which alleviates the Lipschitz singularities of the diffusion model near the zero point of timesteps. Remarkably, our technique yields a substantial improvement in performance. Moreover, as a byproduct of our method, we achieve a dramatic reduction in the Fréchet Inception Distance of acceleration methods relying on network Lipschitz, including DDIM and DPM-Solver, by over 33%. Extensive experiments on diverse datasets validate our theory and method. Our work may advance the understanding of the general diffusion process, and also provide insights for the design of diffusion models.
Paper Structure (32 sections, 2 theorems, 27 equations, 22 figures, 8 tables, 2 algorithms)

This paper contains 32 sections, 2 theorems, 27 equations, 22 figures, 8 tables, 2 algorithms.

Key Result

Theorem 3.1

Given a noise schedule, since $\sigma_t = \sqrt{1 - \alpha_t^2}$, we have $\frac{d \sigma_t}{dt} = -\frac{\alpha_t}{\sqrt{1-\alpha_t^2}} \frac{d\alpha_t}{dt}$. As $t$ gets close to 0, the noise schedule requires $\alpha_t \rightarrow 1$, leading to $d \sigma_t / dt \rightarrow \infty$ as long as $\f Note that $\alpha_t \rightarrow 1$ as $t \rightarrow 0$, thus if $\frac{d\alpha_t}{dt}|_{t=0}\neq 0

Figures (22)

  • Figure 1: (a) Conceptual comparison between DDPM ho2020denoising (I) and our proposed early timestep-shared diffusion model (E-TSDM) (II). DDPM trains the network $\epsilon_\theta(\cdot, t)$ with varying timestep conditions $t$ at each denoising step, whereas E-TSDM uniformly divides the near-zero timestep interval $t\in [0, \tilde{t})$ with high Lipschitz constants into $n$ sub-intervals and shares the condition $t$ within each sub-interval. Here, $\tilde{t}$ denotes the length of the interval for sharing conditions. When $t \ge \tilde{t}$, E-TSDM follows the same procedure as DDPM. However, when $t < \tilde{t}$, E-TSDM shares timestep conditions. (b) Quantitative comparison of the Lipschitz constants between DDPM and our proposed early timestep-shared diffusion model (E-TSDM). The Lipschitz constants tend to be extremely large near zero point for DDPM. However, our sharing approach allows E-TSDM to force the Lipschitz constants in each sub-interval to be zero, thereby reducing the overall Lipschitz constants in the timestep interval $t\in [0, \tilde{t})$, where $\tilde{t}$ is set as a default value 100.
  • Figure 2: Quantitative comparison of the errors caused by a perturbation on the input between E-TSDM and DDPM ho2020denoising. Results show that E-TSDM is more stable, as its prediction is less affected, e.g., the perturbation error of DDPM is 42.0% larger than E-TSDM when the perturbation scale is 0.2.
  • Figure 3: Quantitative analysis of alternative methods evaluated with FID-10k $\downarrow$. (a) Regularization: Experimental results on FFHQ $256\times256$ and CelebAHQ $256\times256$ show that regularization techniques can slightly improve the FID of DDPM ho2020denoising baseline but performs worse than E-TSDM (b) Modification of noise schedules (Modified-NS): We implement Modified-NS on linear, quadratic, and cosine schedules. Experimental results on FFHQ $256\times256$ dataset indicate that the performance of Modified-NS is unstable while E-TSDM achieves better synthesis performance. (c) Remap: Quantitative comparison of remap method between uniformly sampling $t$ and uniformly sampling $\lambda$, during training and inference, on FFHQ $256\times256$. Specifically, $\mathcal{U}_t$ is $\mathcal{U}[0,1]$, and $\mathcal{U}_\lambda$ is $\mathcal{U}[0, K]$ for $1/t$ but $\mathcal{U}[-K, K]$ for Inverse-Sigmoid, where $K$ is a large number to avoid infinity. (T) represents the sampling strategy during the training process while (I) represents that during the inference process. Results show that remap is not helpful.
  • Figure 4: Quantitative comparison on various datasets with $256\times256$ resolution. All experiments are evaluated with FID-10k $\downarrow$.
  • Figure 5: Ablation study on the length of the interval $t\in [0, \tilde{t})$ to share the timestep conditions, $\tilde{t}$, and the number of sub-intervals in this interval, $n$, using FID-10k $\downarrow$ as the evaluation metric. We repeat each experiment three times and provide the error bars.
  • ...and 17 more figures

Theorems & Definitions (2)

  • Theorem 3.1
  • Theorem 4.1