A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

Kai Wang; Mingjia Shi; Yukun Zhou; Zekai Li; Zhihang Yuan; Yuzhang Shang; Xiaojiang Peng; Hanwang Zhang; Yang You

A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

Kai Wang, Mingjia Shi, Yukun Zhou, Zekai Li, Zhihang Yuan, Yuzhang Shang, Xiaojiang Peng, Hanwang Zhang, Yang You

TL;DR

SpeeD tackles the high cost of diffusion-model training by dissecting the time-step dynamics into acceleration, deceleration, and convergence regions. It introduces asymmetric sampling to downweight convergence-area steps and change-aware weighting to emphasize rapid-change steps, yielding a consistent ~3× speed-up across architectures and datasets with negligible overhead. The approach is theoretically grounded, providing boundary definitions and generalization to s-sigma scheduled SDEs, and it proves robust across tasks, datasets, and competing acceleration methods. Practically, this work lowers the barrier to diffusion-model research by reducing training costs while maintaining or improving sample quality and applicability to conditional generation tasks.

Abstract

Training diffusion models is always a computation-intensive task. In this paper, we introduce a novel speed-up method for diffusion model training, called, which is based on a closer look at time steps. Our key findings are: i) Time steps can be empirically divided into acceleration, deceleration, and convergence areas based on the process increment. ii) These time steps are imbalanced, with many concentrated in the convergence area. iii) The concentrated steps provide limited benefits for diffusion training. To address this, we design an asymmetric sampling strategy that reduces the frequency of steps from the convergence area while increasing the sampling probability for steps from other areas. Additionally, we propose a weighting strategy to emphasize the importance of time steps with rapid-change process increments. As a plug-and-play and architecture-agnostic approach, SpeeD consistently achieves 3-times acceleration across various diffusion architectures, datasets, and tasks. Notably, due to its simple design, our approach significantly reduces the cost of diffusion model training with minimal overhead. Our research enables more researchers to train diffusion models at a lower cost.

A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

TL;DR

Abstract

Paper Structure (52 sections, 6 theorems, 17 equations, 12 figures, 13 tables)

This paper contains 52 sections, 6 theorems, 17 equations, 12 figures, 13 tables.

Introduction
Speeding Up Training: Time Steps
Preliminaries of Diffusion Models
Overview of SpeeD
Asymmetric Sampling
Threshold Selection $\tau$.
Change-Aware Weighting
Case Study: DDPM
Analyses.
Takeaways.
General Cases: Beyond DDPM
Generalize Theorem \ref{['theo:bound']}
Experiments
Visualization
Implementation Details
...and 37 more sections

Key Result

Theorem 1

In DDPM's setting ho2020denoising, the linear schedule hyper-parameters $\{\beta_{t}\}_{t\in[T]}$ is an equivariant series, the extreme deviation $\Delta_{\beta} := \max_{t} \beta_{t}-\min_{t}\beta_{t}$, $T$ is the total number of time steps, and we have the bounds about the process increment $\delt where $~\hat{\phi}_{t}:=\beta_{\max}\exp\{-(\beta_{0}+{\Delta_{\beta}t}/{2T})t\}$ and $~\hat{\Psi}_

Figures (12)

Figure 1: Closer look at time steps: More than half of the time steps are almost pure noise and easy-to-learn. Motivation: designing an efficient training via analyzing process increment $\delta_{t}$ at different time steps. $\mathbf{E}(\delta_{t})$ and $\text{Var}(\delta_{t})$ are the mean and variance of process increments $\delta_{t}$. Two histograms represent the proportions of the process increments at different noise levels (left) and the proportions of the time steps (right) in the three areas. The loss curve is obtained from DDPM ho2020denoising on CIFAR-10 krizhevsky2009learning.
Figure 2: Re-weighting and re-sampling methods can't eliminate the redundancy and under-sample issues. $w(t)$ and $\mathbf{P}(t)$ are respectively the weighting and sampling curve. The probability of convergence area being sampled remains, while the one of acceleration is reduced faster.
Figure 3: Core designs of SpeeD. Red and blue lines denote sampling and weighting curves.
Figure 4: Visualization of Theorem \ref{['theo:bound']}: three areas of acceleration, deceleration and convergence.
Figure 5: Our SpeeD obtains significant improvements than baseline in visualizations. More visualizations on other datasets and tasks can be found in the Appendix \ref{['vis_appendix']}.
...and 7 more figures

Theorems & Definitions (9)

Theorem 1: Process increment in DDPM
Remark 1
Lemma 1: Bounded $\alpha$ by $\beta$
Proposition A.1: Jensen's inequality
Proposition A.2: triangle inequality
Proposition A.3: matrix norm compatibility
Proposition A.4: Peter Paul inequality
proof
proof

A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

TL;DR

Abstract

A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (9)