Table of Contents
Fetching ...

Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures

Huijie Zhang, Yifu Lu, Ismail Alkhouri, Saiprasad Ravishankar, Dogyoon Song, Qing Qu

TL;DR

Diffusion models face slow training and sampling due to long diffusion trajectories and large time-conditioned networks. The authors propose a multi-stage diffusion framework that couples a universal encoder with stage-specific decoders and introduces an optimal denoiser–based timestep clustering to partition timesteps, enabling efficient resource allocation and reduced inter-stage interference. Across three state-of-the-art diffusion models, including large-scale latent diffusion models, this approach yields substantial gains in training and sampling efficiency while maintaining or improving sample quality, as evidenced by improved FID and reduced computational budgets. The work also provides thorough ablations showing the impact of timestep clustering and the multi-decoder design on overall performance, suggesting broad applicability to both unconditional and more complex diffusion setups.

Abstract

Diffusion models, emerging as powerful deep generative tools, excel in various applications. They operate through a two-steps process: introducing noise into training samples and then employing a model to convert random noise into new samples (e.g., images). However, their remarkable generative performance is hindered by slow training and sampling. This is due to the necessity of tracking extensive forward and reverse diffusion trajectories, and employing a large model with numerous parameters across multiple timesteps (i.e., noise levels). To tackle these challenges, we present a multi-stage framework inspired by our empirical findings. These observations indicate the advantages of employing distinct parameters tailored to each timestep while retaining universal parameters shared across all time steps. Our approach involves segmenting the time interval into multiple stages where we employ custom multi-decoder U-net architecture that blends time-dependent models with a universally shared encoder. Our framework enables the efficient distribution of computational resources and mitigates inter-stage interference, which substantially improves training efficiency. Extensive numerical experiments affirm the effectiveness of our framework, showcasing significant training and sampling efficiency enhancements on three state-of-the-art diffusion models, including large-scale latent diffusion models. Furthermore, our ablation studies illustrate the impact of two important components in our framework: (i) a novel timestep clustering algorithm for stage division, and (ii) an innovative multi-decoder U-net architecture, seamlessly integrating universal and customized hyperparameters.

Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures

TL;DR

Diffusion models face slow training and sampling due to long diffusion trajectories and large time-conditioned networks. The authors propose a multi-stage diffusion framework that couples a universal encoder with stage-specific decoders and introduces an optimal denoiser–based timestep clustering to partition timesteps, enabling efficient resource allocation and reduced inter-stage interference. Across three state-of-the-art diffusion models, including large-scale latent diffusion models, this approach yields substantial gains in training and sampling efficiency while maintaining or improving sample quality, as evidenced by improved FID and reduced computational budgets. The work also provides thorough ablations showing the impact of timestep clustering and the multi-decoder design on overall performance, suggesting broad applicability to both unconditional and more complex diffusion setups.

Abstract

Diffusion models, emerging as powerful deep generative tools, excel in various applications. They operate through a two-steps process: introducing noise into training samples and then employing a model to convert random noise into new samples (e.g., images). However, their remarkable generative performance is hindered by slow training and sampling. This is due to the necessity of tracking extensive forward and reverse diffusion trajectories, and employing a large model with numerous parameters across multiple timesteps (i.e., noise levels). To tackle these challenges, we present a multi-stage framework inspired by our empirical findings. These observations indicate the advantages of employing distinct parameters tailored to each timestep while retaining universal parameters shared across all time steps. Our approach involves segmenting the time interval into multiple stages where we employ custom multi-decoder U-net architecture that blends time-dependent models with a universally shared encoder. Our framework enables the efficient distribution of computational resources and mitigates inter-stage interference, which substantially improves training efficiency. Extensive numerical experiments affirm the effectiveness of our framework, showcasing significant training and sampling efficiency enhancements on three state-of-the-art diffusion models, including large-scale latent diffusion models. Furthermore, our ablation studies illustrate the impact of two important components in our framework: (i) a novel timestep clustering algorithm for stage division, and (ii) an innovative multi-decoder U-net architecture, seamlessly integrating universal and customized hyperparameters.
Paper Structure (34 sections, 1 theorem, 19 equations, 3 figures, 7 tables, 1 algorithm)

This paper contains 34 sections, 1 theorem, 19 equations, 3 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

Suppose we train a diffusion model denoiser function $\bm \epsilon_{\bm \theta} (\bm x,t)$ with parameters $\bm \theta$ using dataset $\left\lbrace \bm y_i \in \mathbb{R}^n \right\rbrace_{i=1}^N$ by where $\bm x_0 \sim p_{\text{data}}(\bm x) = \frac{1}{N} \sum_{i =1}^{N} \delta (\bm x - \bm y_i)$, $\bm \epsilon \sim \mathcal{N}(0,\textbf{I})$, and $\bm x_t \sim p_{t}(\bm x_t|\bm x_0) = \mathcal{

Figures (3)

  • Figure 1: Overview of three diffusion model architectures: (a) unified, (b) separate, and (c) our proposed multistage architectures. Compared with (a) and (b), our approach improves sampling quality, and significantly enhances training efficiency, as indicated by the FID scores and their corresponding training iterations (d).
  • Figure 2: Comparison between separate architecture and unified architecture w.r.t. the image generation quality in different intervals: (a) analysis on Interval $[0, t_1)$; and (b) analysis on Interval $[t_2, 1]$. As illustrated on top of each figure, we only train separate architectures within specific intervals for the sampling process in both (a) and (b). For the remaining period of sampling, we use a well-trained diffusion model $\hbox{$(\bm{\epsilon}_{\bm\theta})^{[0,1]}_{4 \times 10^5}$}$ to approximate the ground truth score function. As shown in the above figure of (a), e.g. for the separate architecture on interval 1, sampling utilizes trained model $\hbox{$(\bm{\epsilon}_{\bm\theta'})^{[0,t_1)}_{i}$}$ for interval 0 and well-trained model $\hbox{$(\bm{\epsilon}_{\bm\theta})^{[0,1]}_{4 \times 10^5}$}$ for interval 1 and 2. Notably, for both $\hbox{$(\bm{\epsilon}_{\bm\theta})^{[0, 1]}_{i}$}$ and $\hbox{$(\bm{\epsilon}_{\bm\theta})^{[0, 1]}_{4\times10^5}$}$, we are using the model with 108M parameters. For separate architecture, the number in the parentheses represents the number of parameters of the model $\hbox{$(\bm{\epsilon}_{\bm\theta'})^{[a, b]}_{i}$}$. For example, for separate architecture (169M) in (a), the model $\hbox{$(\bm{\epsilon}_{\bm\theta'})^{[0, t_1)}_{i}$}$ has 169M parameters for $\bm\theta'$. The bottom figures in (a-b) illustrate the FID of the generation from each architecture under different training iterations.
  • Figure 3: Sample generations from Multistage LDM (CelebA $256\times256$) and Multistage DPM-Solver (CIFAR-10 $32\times32$).

Theorems & Definitions (1)

  • Proposition 1