Table of Contents
Fetching ...

Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training

Lexington Whalen, Zhenbang Du, Haoran You, Chaojian Li, Sixu Li, Yingyan, Lin

TL;DR

This paper tackles the high computational cost of training diffusion models by leveraging Early-Bird (EB) tickets. It identifies both traditional EB tickets that emerge early across timesteps and diffusion-dedicated TA-EB tickets that tailor subnetworks to specific timestep regions, enabling aggressive sparsity where appropriate. The authors propose EB-Diff-Train, which trains region-specific TA-EB tickets in parallel and ensembles them at inference, achieving substantial speedups (roughly 2.9×–5.8× over dense training and up to 10.3× over train-prune-finetune) while preserving generation quality. The approach demonstrates strong empirical gains across multiple datasets (CIFAR-10, CelebA, LSUN, ImageNet-1K) and diffusion backbones (DDPM, LDM), and is compatible with existing timestep-resampling methods like SpeeD, indicating broad practical impact for efficient diffusion model training.

Abstract

Training diffusion models (DMs) requires substantial computational resources due to multiple forward and backward passes across numerous timesteps, motivating research into efficient training techniques. In this paper, we propose EB-Diff-Train, a new efficient DM training approach that is orthogonal to other methods of accelerating DM training, by investigating and leveraging Early-Bird (EB) tickets -- sparse subnetworks that manifest early in the training process and maintain high generation quality. We first investigate the existence of traditional EB tickets in DMs, enabling competitive generation quality without fully training a dense model. Then, we delve into the concept of diffusion-dedicated EB tickets, drawing on insights from varying importance of different timestep regions. These tickets adapt their sparsity levels according to the importance of corresponding timestep regions, allowing for aggressive sparsity during non-critical regions while conserving computational resources for crucial timestep regions. Building on this, we develop an efficient DM training technique that derives timestep-aware EB tickets, trains them in parallel, and combines them during inference for image generation. Extensive experiments validate the existence of both traditional and timestep-aware EB tickets, as well as the effectiveness of our proposed EB-Diff-Train method. This approach can significantly reduce training time both spatially and temporally -- achieving 2.9$\times$ to 5.8$\times$ speedups over training unpruned dense models, and up to 10.3$\times$ faster training compared to standard train-prune-finetune pipelines -- without compromising generative quality. Our code is available at https://github.com/GATECH-EIC/Early-Bird-Diffusion.

Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training

TL;DR

This paper tackles the high computational cost of training diffusion models by leveraging Early-Bird (EB) tickets. It identifies both traditional EB tickets that emerge early across timesteps and diffusion-dedicated TA-EB tickets that tailor subnetworks to specific timestep regions, enabling aggressive sparsity where appropriate. The authors propose EB-Diff-Train, which trains region-specific TA-EB tickets in parallel and ensembles them at inference, achieving substantial speedups (roughly 2.9×–5.8× over dense training and up to 10.3× over train-prune-finetune) while preserving generation quality. The approach demonstrates strong empirical gains across multiple datasets (CIFAR-10, CelebA, LSUN, ImageNet-1K) and diffusion backbones (DDPM, LDM), and is compatible with existing timestep-resampling methods like SpeeD, indicating broad practical impact for efficient diffusion model training.

Abstract

Training diffusion models (DMs) requires substantial computational resources due to multiple forward and backward passes across numerous timesteps, motivating research into efficient training techniques. In this paper, we propose EB-Diff-Train, a new efficient DM training approach that is orthogonal to other methods of accelerating DM training, by investigating and leveraging Early-Bird (EB) tickets -- sparse subnetworks that manifest early in the training process and maintain high generation quality. We first investigate the existence of traditional EB tickets in DMs, enabling competitive generation quality without fully training a dense model. Then, we delve into the concept of diffusion-dedicated EB tickets, drawing on insights from varying importance of different timestep regions. These tickets adapt their sparsity levels according to the importance of corresponding timestep regions, allowing for aggressive sparsity during non-critical regions while conserving computational resources for crucial timestep regions. Building on this, we develop an efficient DM training technique that derives timestep-aware EB tickets, trains them in parallel, and combines them during inference for image generation. Extensive experiments validate the existence of both traditional and timestep-aware EB tickets, as well as the effectiveness of our proposed EB-Diff-Train method. This approach can significantly reduce training time both spatially and temporally -- achieving 2.9 to 5.8 speedups over training unpruned dense models, and up to 10.3 faster training compared to standard train-prune-finetune pipelines -- without compromising generative quality. Our code is available at https://github.com/GATECH-EIC/Early-Bird-Diffusion.

Paper Structure

This paper contains 12 sections, 3 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: FID (lower values indicate higher generation quality) vs. relative training time (lower values indicate greater runtime efficiency) under different pruning rates ("$p$") for the CIFAR-10 cifar10 dataset and the DDPM ho2020denoising model, comparing our methods against the standard train-prune-finetune method paired with random frankle2018the, magnitude hanMagnitudePrune, Taylor Molchanov_2019_CVPR, and Diff-Pruning diffPruning pruning methods. "Scratch" indicates a model pruned then retrained from random initialization, "Unpruned" is the model without any pruning, "EB" (as detailed in Sec. 4.1) uses a single EB ticket across all timesteps, and "TA-EB" (as detailed in Sec. 4.2) employs three timestep-aware EB tickets for specific regions. Smaller circles indicate higher pruning rates; relative training time represents the ratio to the unpruned model's training time. Generation quality is measured by the FID fid_score score.
  • Figure 2: Visualization of pairwise hamming distance matrices for both the CIFAR-10 cifar10 and CelebA celeba datasets, when using structural magnitude pruning at pruning rates of 30% and 50%. EB tickets (marked by red boxes) are consistently found during the early stages of training.
  • Figure 3: (a) The varying importance of timestep regions throughout the diffusion training trajectory wang2024closer, which motivates our investigation into TA-EB tickets. (b) A comparison between vanilla training of DMs and our proposed TA-EB training, which offers the dual advantages of reducing model size through dedicated EB subnetworks and enhancing parallelism, thereby yielding savings both spatially and temporally.
  • Figure 4: Visualization of pairwise Hamming distance matrices for the CIFAR-10 cifar10 dataset under structural magnitude pruning at pruning rates of 30%, 60%, and 80% across timestep regions 0-260, 240-460, and 440-1000, respectively. EB tickets are consistently observed during the early stages of each timestep region.
  • Figure 5: Qualitative comparison of generated image results for CelebA (left 3$\times$ 3 section) and CIFAR-10 (right 6$\times$ 6 section) using the DDPM ho2020denoising model. From left to right: Generations from the unpruned model, our Early-Bird (EB) model with a 50% pruning rate, and our Timestep-Aware Early-Bird (TA-EB) 64% average pruning rate.