Table of Contents
Fetching ...

Adaptive Non-uniform Timestep Sampling for Accelerating Diffusion Model Training

Myunsoo Kim, Donghyeon Ki, Seong-Woong Shim, Byung-Jun Lee

TL;DR

The paper tackles the high computational cost of training diffusion models by identifying non-uniform gradient variance across timesteps as a key bottleneck. It introduces an online, learning-based adaptive timestep sampler $\pi_\phi$ that prioritizes timesteps whose gradient updates most reduce the variational lower bound $\mathcal{L}_{VLB}$, using a surrogate Delta $\tilde{\Delta}_k^t$ computed from a small subset of timesteps. Through extensive experiments across CIFAR-10, CelebA-HQ, and ImageNet with diverse schedules and backbones, the method demonstrates faster convergence and improved final fidelity (lower FID) than heuristic acceleration strategies, while remaining robust to scheduling and architecture changes and effectively combining with existing heuristics. The approach offers a practical path to faster, more robust diffusion-model training and broad applicability across domains, with a clear extension potential to score-based diffusion models in future work.

Abstract

As a highly expressive generative model, diffusion models have demonstrated exceptional success across various domains, including image generation, natural language processing, and combinatorial optimization. However, as data distributions grow more complex, training these models to convergence becomes increasingly computationally intensive. While diffusion models are typically trained using uniform timestep sampling, our research shows that the variance in stochastic gradients varies significantly across timesteps, with high-variance timesteps becoming bottlenecks that hinder faster convergence. To address this issue, we introduce a non-uniform timestep sampling method that prioritizes these more critical timesteps. Our method tracks the impact of gradient updates on the objective for each timestep, adaptively selecting those most likely to minimize the objective effectively. Experimental results demonstrate that this approach not only accelerates the training process, but also leads to improved performance at convergence. Furthermore, our method shows robust performance across various datasets, scheduling strategies, and diffusion architectures, outperforming previously proposed timestep sampling and weighting heuristics that lack this degree of robustness.

Adaptive Non-uniform Timestep Sampling for Accelerating Diffusion Model Training

TL;DR

The paper tackles the high computational cost of training diffusion models by identifying non-uniform gradient variance across timesteps as a key bottleneck. It introduces an online, learning-based adaptive timestep sampler that prioritizes timesteps whose gradient updates most reduce the variational lower bound , using a surrogate Delta computed from a small subset of timesteps. Through extensive experiments across CIFAR-10, CelebA-HQ, and ImageNet with diverse schedules and backbones, the method demonstrates faster convergence and improved final fidelity (lower FID) than heuristic acceleration strategies, while remaining robust to scheduling and architecture changes and effectively combining with existing heuristics. The approach offers a practical path to faster, more robust diffusion-model training and broad applicability across domains, with a clear extension potential to score-based diffusion models in future work.

Abstract

As a highly expressive generative model, diffusion models have demonstrated exceptional success across various domains, including image generation, natural language processing, and combinatorial optimization. However, as data distributions grow more complex, training these models to convergence becomes increasingly computationally intensive. While diffusion models are typically trained using uniform timestep sampling, our research shows that the variance in stochastic gradients varies significantly across timesteps, with high-variance timesteps becoming bottlenecks that hinder faster convergence. To address this issue, we introduce a non-uniform timestep sampling method that prioritizes these more critical timesteps. Our method tracks the impact of gradient updates on the objective for each timestep, adaptively selecting those most likely to minimize the objective effectively. Experimental results demonstrate that this approach not only accelerates the training process, but also leads to improved performance at convergence. Furthermore, our method shows robust performance across various datasets, scheduling strategies, and diffusion architectures, outperforming previously proposed timestep sampling and weighting heuristics that lack this degree of robustness.

Paper Structure

This paper contains 33 sections, 1 theorem, 15 equations, 8 figures, 9 tables, 2 algorithms.

Key Result

Theorem 1

[garrigos2023handbook, informal] Given a few reasonable assumptions, for every $\varepsilon > 0$, we can guarantee that $\mathbb{E}\left[ \mathcal{L}_t(\theta_{K}) - \inf \mathcal{L}_t \right] \leq \varepsilon$ provided that Here, $\gamma$ is the learning rate of Stochastic Gradient Descent (SGD), $\theta_{K}$ is the parameter after $K$ steps of gradient updates, and $\theta_{*}$ is the optimal p

Figures (8)

  • Figure 1: Comparison of FID scores of our approach and other acceleration methods against relative wall clock time, where 1D represents the time it takes for the baseline to converge. Although our learning method is initially slower than the heuristics due to its learning-based nature, it converges to a point with better optimality within 1D, and achieves a significantly lower FID score by 1.2D.
  • Figure 2: Comparison of gradient variance (top) and diffusion loss (bottom) over timesteps for three different noise schedules: linear, cosine, and quadratic. The color bar on the right indicates the progression of epochs, with red representing the early epochs and blue representing the later epochs. The results are based on training DDPM ho2020denoising with $\mathcal{L}_{\text{DDPM}}$ for 20 epochs.
  • Figure 3: Adaptive Non-Uniform Timestep Sampling. Visualization of the overall architecture of our algorithm. Following an update from $\theta_k$ to $\theta_{k+1}$, we compute the set $\{\delta^t_{k,\tau}\}_{\tau=1}^T$ based on $\theta_k$ and $\theta_{k+1}$, and add it to a queue. A feature selection method is then applied to identify $|S|$ timesteps that best explains ${\Delta}^t_k$ from this queue. Then a timestep sampler $\pi_\phi$ is trained to maximize $\tilde{\Delta}^t_k$, which samples timestep s $t$ that is expected to achieve the largest reduction in the diffusion loss. This process continues iteratively to minimize the diffusion loss at the chosen timesteps.
  • Figure 4: Difference in diffusion loss observed (after and before model update) after training a diffusion model by sampling only timesteps within the range [0, 200). A significant increase in loss is notably observed for the unsampled timesteps.
  • Figure 5: Visualization of timestep sampling schemes of our method, Min-SNR, P2, Log-Normal, and SpeeD. For weighting methods (Min-SNR, P2), weights at each timestep are converted to probabilities. The color bar on the right represents the progression of epochs, providing a view of how each scheduling method samples timesteps across the training process.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Theorem 1