Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection

Alireza Ganjdanesh; Yan Kang; Yuchen Liu; Richard Zhang; Zhe Lin; Heng Huang

Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection

Alireza Ganjdanesh, Yan Kang, Yuchen Liu, Richard Zhang, Zhe Lin, Heng Huang

TL;DR

This work proposes to reduce the sampling cost by pruning a pretrained diffusion model into a mixture of efficient experts, and introduces the Expert Routing Agent, which learns to select a set of proper network configurations to optimize the resource usage between experts.

Abstract

Diffusion probabilistic models can generate high-quality samples. Yet, their sampling process requires numerous denoising steps, making it slow and computationally intensive. We propose to reduce the sampling cost by pruning a pretrained diffusion model into a mixture of efficient experts. First, we study the similarities between pairs of denoising timesteps, observing a natural clustering, even across different datasets. This suggests that rather than having a single model for all time steps, separate models can serve as ``experts'' for their respective time intervals. As such, we separately fine-tune the pretrained model on each interval, with elastic dimensions in depth and width, to obtain experts specialized in their corresponding denoising interval. To optimize the resource usage between experts, we introduce our Expert Routing Agent, which learns to select a set of proper network configurations. By doing so, our method can allocate the computing budget between the experts in an end-to-end manner without requiring manual heuristics. Finally, with a selected configuration, we fine-tune our pruned experts to obtain our mixture of efficient experts. We demonstrate the effectiveness of our method, DiffPruning, across several datasets, LSUN-Church, LSUN-Beds, FFHQ, and ImageNet, on the Latent Diffusion Model architecture.

Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection

TL;DR

Abstract

Paper Structure (31 sections, 19 equations, 12 figures, 7 tables, 1 algorithm)

This paper contains 31 sections, 19 equations, 12 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Method
Background
Notations
Clustering Denoising Timesteps into Intervals
Fine-tuning with Elastic Dimensions
Expert Routing Agent
Pruning Width:
Pruning depth.
Pruning the Mixture of Experts
Fine-tuning pruned models.
Experiments
Comparison Results
Ablation Study
...and 16 more sections

Figures (12)

Figure 1: Overview of DiffPruning. We prune a pre-trained LDM model rombach2022LDM (top) into a mixture of efficient experts (bottom). Each expert handles an interval, which allows their architectures to be separately specialized by removing layers or channels.
Figure 1: Comparison Results of our method vs. baselines, SP fang2023StructuralPruningforDMs, OMS-DPM liu2023OMS-DPM, DDPM ho2020ddpm, and LDM rombach2022LDM. First Row: FID vs. MACs curves. Second Row: FID vs. Throughput curves. We calculate the Throughput values with an NVIDIA A100 GPU. Higher Throughput and Lower FID and MACs indicate a better performance.
Figure 2: Our Pruning Scheme: We train our Expert Routing Agent (ERA) to prune the experts into a mixture of efficient experts (Sec. \ref{['ERA']}). The ERA predicts the architecture vectors $(v, u)$ to prune experts' width and depth. Then, we calculate the denoising objectives of selected sub-networks of experts, $\mathcal{L}_{\text{DDPM},\mathcal{I}_i}$, as well as our Resource regularization term, $\mathcal{R}$, that encourages the ERA to provide a mixture of efficient experts with a desired compute budget (MACs). We train ERA's parameters to minimize the objective functions. Thus, it learns to automatically allocate the compute budget (MACs) between experts in an end-to-end manner.
Figure 2: Weighted average $\mathcal{J}(t_1)$ (Eq. \ref{['obj_cutoff']}) of the mean of alignment scores in two clusters for the LDM trained on FFHQ.
Figure 3: Our Interval Selection Scheme: We calculate gradients of denoising timesteps' objectives w.r.t the pre-trained LDM's parameters and take the cosine similarity value of two timesteps' gradients as their alignment score. The dashed lines show our selected cluster intervals for the experts. One can observe the optimal cluster assignments are different for distinct datasets, and employing a deterministic clustering strategy balaji2022ediffiMOE like uniform clustering feng2023ernieMOE for all datasets is sub-optimal.
...and 7 more figures

Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection

TL;DR

Abstract

Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection

Authors

TL;DR

Abstract

Table of Contents

Figures (12)