Adaptive Training Meets Progressive Scaling: Elevating Efficiency in Diffusion Models

Wenhao Li; Xiu Su; Yu Han; Shan You; Tao Huang; Chang Xu

Adaptive Training Meets Progressive Scaling: Elevating Efficiency in Diffusion Models

Wenhao Li, Xiu Su, Yu Han, Shan You, Tao Huang, Chang Xu

TL;DR

This paper tackles the inefficiency of diffusion models relying on a single denoiser across all timesteps, where distributions and task difficulty vary substantially. It introduces TDC Training, a two-stage divide-and-conquer framework that groups timesteps by difficulty using the SNR-based measure $SNR = 10 \log_{10}\left(\frac{\overline{\alpha}_t}{1-\overline{\alpha}_t}\right)$ and allocates progressive FLOPs via $FLOPs_{g}(i)=\left(\frac{i}{\mathcal{N}}+\frac{\mathcal{N}-i}{\mathcal{N}}\times k\right)\mathcal{F}$, followed by deriving group-specific denoisers through Proxy-based Pruning with GPT-4 and a memory bank for iterative refinement. The approach yields substantial FID improvements (e.g., $0.32$ on CIFAR10, $1.5$ on ImageNet64, $0.27$ on FFHQ) while reducing compute by about 20% across IDDPM and LDM. A two-stage training strategy outperforms single-stage counterparts and proves robust to FLOPs budgeting ($k$), with pruning stability aided by the memory mechanism. Overall, the method provides a practical, scalable path to task-aware diffusion with meaningful efficiency gains.

Abstract

Diffusion models have demonstrated remarkable efficacy in various generative tasks with the predictive prowess of denoising model. Currently, diffusion models employ a uniform denoising model across all timesteps. However, the inherent variations in data distributions at different timesteps lead to conflicts during training, constraining the potential of diffusion models. To address this challenge, we propose a novel two-stage divide-and-conquer training strategy termed TDC Training. It groups timesteps based on task similarity and difficulty, assigning highly customized denoising models to each group, thereby enhancing the performance of diffusion models. While two-stage training avoids the need to train each model separately, the total training cost is even lower than training a single unified denoising model. Additionally, we introduce Proxy-based Pruning to further customize the denoising models. This method transforms the pruning problem of diffusion models into a multi-round decision-making problem, enabling precise pruning of diffusion models. Our experiments validate the effectiveness of TDC Training, demonstrating improvements in FID of 1.5 on ImageNet64 compared to original IDDPM, while saving about 20\% of computational resources.

Adaptive Training Meets Progressive Scaling: Elevating Efficiency in Diffusion Models

TL;DR

and allocates progressive FLOPs via

, followed by deriving group-specific denoisers through Proxy-based Pruning with GPT-4 and a memory bank for iterative refinement. The approach yields substantial FID improvements (e.g.,

on CIFAR10,

on ImageNet64,

on FFHQ) while reducing compute by about 20% across IDDPM and LDM. A two-stage training strategy outperforms single-stage counterparts and proves robust to FLOPs budgeting (

), with pruning stability aided by the memory mechanism. Overall, the method provides a practical, scalable path to task-aware diffusion with meaningful efficiency gains.

Abstract

Paper Structure (12 sections, 12 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 12 sections, 12 equations, 8 figures, 6 tables, 1 algorithm.

Introduction
Method
Unequal Timesteps in Denoising Capacity
Progressive FLOPs Allocation with Grouped Steps
TDC Training for Progressive Diffusion Models
Experiments
Experiments on TDC Training
Comparative Experiments of Pruning Methods
Comparative Analysis: Single-Stage vs. Two-Stage Training Strategies
Stability of Proxy Pruning
Ablation Study of FLOPs Constraint $k$
Conclusion

Figures (8)

Figure 1: Visualization of Diffusion Model Performance: Circle sizes represent computational costs (GFLOPs) while vertical positioning indicates FID scores.
Figure 2: Pipeline of Our TDC Training Strategy. First, SNR for each timestep is calculated to estimate the difficulty of the denoising task. Timesteps are then grouped based on task difficulty, and model capacity is allocated accordingly. During training, a base model covering all timesteps is trained in the first phase. In the second phase, for each group, Proxy-based Pruning is applied to the base model according to the allocated model capacity, and then fine-tuning is performed on the timesteps within each group to obtain specialized models for each group.
Figure 3: Comparison of FID and Training Steps Across Different Training Strategies
Figure 4: Sample images of LDM on FFHQ with (top) and without (bottom) our TDC Training(100 sampling steps).
Figure 5: Mean-std Curve over Pruning Rounds.
...and 3 more figures

Adaptive Training Meets Progressive Scaling: Elevating Efficiency in Diffusion Models

TL;DR

Abstract

Adaptive Training Meets Progressive Scaling: Elevating Efficiency in Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)