Table of Contents
Fetching ...

Towards Faster Training of Diffusion Models: An Inspiration of A Consistency Phenomenon

Tianshuo Xu, Peng Mi, Ruilin Wang, Yingcong Chen

TL;DR

This work tackles the high computational cost of training diffusion models by identifying a consistency phenomenon: despite different initializations or architectures, diffusion models produce remarkably similar outputs when conditioned on the same noise, especially as $t$ approaches $T$ where $x_t$ tends to $\epsilon$. The authors attribute this to easier learning in high-noise regimes and the overall smoothness of the DM loss landscape, and they design two acceleration strategies: a curriculum-learning based timestep schedule (CLTS) and a momentum decay with learning rate compensation (MDLRC). Through extensive experiments on CIFAR10 and ImageNet128, these methods yield substantial training speedups (e.g., $2\times$ to $2.6\times$) while maintaining or improving sample quality (lower FID) compared with state-of-the-art approaches. The work provides both theoretical insight into the stability of DMs and practical techniques for faster diffusion-based generation in real-world settings.

Abstract

Diffusion models (DMs) are a powerful generative framework that have attracted significant attention in recent years. However, the high computational cost of training DMs limits their practical applications. In this paper, we start with a consistency phenomenon of DMs: we observe that DMs with different initializations or even different architectures can produce very similar outputs given the same noise inputs, which is rare in other generative models. We attribute this phenomenon to two factors: (1) the learning difficulty of DMs is lower when the noise-prediction diffusion model approaches the upper bound of the timestep (the input becomes pure noise), where the structural information of the output is usually generated; and (2) the loss landscape of DMs is highly smooth, which implies that the model tends to converge to similar local minima and exhibit similar behavior patterns. This finding not only reveals the stability of DMs, but also inspires us to devise two strategies to accelerate the training of DMs. First, we propose a curriculum learning based timestep schedule, which leverages the noise rate as an explicit indicator of the learning difficulty and gradually reduces the training frequency of easier timesteps, thus improving the training efficiency. Second, we propose a momentum decay strategy, which reduces the momentum coefficient during the optimization process, as the large momentum may hinder the convergence speed and cause oscillations due to the smoothness of the loss landscape. We demonstrate the effectiveness of our proposed strategies on various models and show that they can significantly reduce the training time and improve the quality of the generated images.

Towards Faster Training of Diffusion Models: An Inspiration of A Consistency Phenomenon

TL;DR

This work tackles the high computational cost of training diffusion models by identifying a consistency phenomenon: despite different initializations or architectures, diffusion models produce remarkably similar outputs when conditioned on the same noise, especially as approaches where tends to . The authors attribute this to easier learning in high-noise regimes and the overall smoothness of the DM loss landscape, and they design two acceleration strategies: a curriculum-learning based timestep schedule (CLTS) and a momentum decay with learning rate compensation (MDLRC). Through extensive experiments on CIFAR10 and ImageNet128, these methods yield substantial training speedups (e.g., to ) while maintaining or improving sample quality (lower FID) compared with state-of-the-art approaches. The work provides both theoretical insight into the stability of DMs and practical techniques for faster diffusion-based generation in real-world settings.

Abstract

Diffusion models (DMs) are a powerful generative framework that have attracted significant attention in recent years. However, the high computational cost of training DMs limits their practical applications. In this paper, we start with a consistency phenomenon of DMs: we observe that DMs with different initializations or even different architectures can produce very similar outputs given the same noise inputs, which is rare in other generative models. We attribute this phenomenon to two factors: (1) the learning difficulty of DMs is lower when the noise-prediction diffusion model approaches the upper bound of the timestep (the input becomes pure noise), where the structural information of the output is usually generated; and (2) the loss landscape of DMs is highly smooth, which implies that the model tends to converge to similar local minima and exhibit similar behavior patterns. This finding not only reveals the stability of DMs, but also inspires us to devise two strategies to accelerate the training of DMs. First, we propose a curriculum learning based timestep schedule, which leverages the noise rate as an explicit indicator of the learning difficulty and gradually reduces the training frequency of easier timesteps, thus improving the training efficiency. Second, we propose a momentum decay strategy, which reduces the momentum coefficient during the optimization process, as the large momentum may hinder the convergence speed and cause oscillations due to the smoothness of the loss landscape. We demonstrate the effectiveness of our proposed strategies on various models and show that they can significantly reduce the training time and improve the quality of the generated images.
Paper Structure (20 sections, 16 equations, 14 figures, 3 tables)

This paper contains 20 sections, 16 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Illustration of the consistency phenomenon in diffusion models (DMs). Despite different initializations or structural variations, DMs trained on the same dataset produce remarkably consistent results when exposed to identical noise during sampling. (a) presents three models nichol2021improved trained on CIFAR10 with different initializations. (b) depicts two models dhariwal2021beatgans trained on ImageNet128 with different structures. (c) showcases the large and huge models of UViT bao2023uvit trained on ImageNet512.
  • Figure 2: Visualization of the loss landscapes of Improved Diffusion nichol2021improved and DCGAN radford2015dcgan, where t is the timestep of DMs. Both models were trained on the CIFAR10 dataset krizhevsky2009cifar. Obviously, the loss landscape of DMs is smoother compared to GANs. More landscapes of DM and GAN can be viewed at Appendix \ref{['app:loss_land']}.
  • Figure 3: Illustration of the Hessian spectrum of DMs(left) and GANs(right). $\lambda_i$ is the $i$-th largest eigenvalue and $\mu$ and $\sigma$ is the mean and variance of eigenvalue respectively. The larger the dominant eigenvalue, the sharper the landscape, and the greater the differences among eigenvalues, the more difficult the model is to optimize.
  • Figure 4: Illustration of the 1D-interpolation results of DMs and GANs. The jitter red line indicates the geometry of GAN's landscape is rougher.
  • Figure 5: Illustration of the application of our optimization approach on different DMs. (a) Improved Diffusion nichol2021improved trained on Cifar10 krizhevsky2009cifar, (b) Guided Diffusion dhariwal2021beatgans trained on ImageNet128 deng2009imagenet. With our methods, these DMs achieve 2$\times$ and 2.6$\times$ speedup in training, respectively.
  • ...and 9 more figures