Table of Contents
Fetching ...

ALTER: All-in-One Layer Pruning and Temporal Expert Routing for Efficient Diffusion Generation

Xiaomeng Yang, Lei Lu, Qihui Fan, Changdi Yang, Juyi Lin, Yanzhi Wang, Xuan Zhang, Shangqian Gao

TL;DR

ALTER addresses the computational burden of diffusion generation by unifying layer-wise pruning and timestep-aware routing within a single-stage hypernetwork-driven framework. By transforming the UNet into a mixture of temporal experts and jointly optimizing pruning masks and routing, it achieves substantial efficiency (e.g., ~25.9% of original MACs with a 3.64x speedup at 20 steps) while preserving high visual fidelity. The approach leverages a dedicated Expert Generator and Temporal Router to allocate denoising timesteps to specialized pruned sub-networks, enabling full model utilization across the diffusion trajectory. Empirical results on SDv2.1 show strong performance against static pruning, sample-wise MoE, and cache-based baselines, highlighting practical benefits for real-time and resource-constrained diffusion deployment.

Abstract

Diffusion models have demonstrated exceptional capabilities in generating high-fidelity images. However, their iterative denoising process results in significant computational overhead during inference, limiting their practical deployment in resource-constrained environments. Existing acceleration methods often adopt uniform strategies that fail to capture the temporal variations during diffusion generation, while the commonly adopted sequential pruning-then-fine-tuning strategy suffers from sub-optimality due to the misalignment between pruning decisions made on pretrained weights and the model's final parameters. To address these limitations, we introduce ALTER: All-in-One Layer Pruning and Temporal Expert Routing, a unified framework that transforms diffusion models into a mixture of efficient temporal experts. ALTER achieves a single-stage optimization that unifies layer pruning, expert routing, and model fine-tuning by employing a trainable hypernetwork, which dynamically generates layer pruning decisions and manages timestep routing to specialized, pruned expert sub-networks throughout the ongoing fine-tuning of the UNet. This unified co-optimization strategy enables significant efficiency gains while preserving high generative quality. Specifically, ALTER achieves same-level visual fidelity to the original 50-step Stable Diffusion v2.1 model while utilizing only 25.9% of its total MACs with just 20 inference steps and delivering a 3.64x speedup through 35% sparsity.

ALTER: All-in-One Layer Pruning and Temporal Expert Routing for Efficient Diffusion Generation

TL;DR

ALTER addresses the computational burden of diffusion generation by unifying layer-wise pruning and timestep-aware routing within a single-stage hypernetwork-driven framework. By transforming the UNet into a mixture of temporal experts and jointly optimizing pruning masks and routing, it achieves substantial efficiency (e.g., ~25.9% of original MACs with a 3.64x speedup at 20 steps) while preserving high visual fidelity. The approach leverages a dedicated Expert Generator and Temporal Router to allocate denoising timesteps to specialized pruned sub-networks, enabling full model utilization across the diffusion trajectory. Empirical results on SDv2.1 show strong performance against static pruning, sample-wise MoE, and cache-based baselines, highlighting practical benefits for real-time and resource-constrained diffusion deployment.

Abstract

Diffusion models have demonstrated exceptional capabilities in generating high-fidelity images. However, their iterative denoising process results in significant computational overhead during inference, limiting their practical deployment in resource-constrained environments. Existing acceleration methods often adopt uniform strategies that fail to capture the temporal variations during diffusion generation, while the commonly adopted sequential pruning-then-fine-tuning strategy suffers from sub-optimality due to the misalignment between pruning decisions made on pretrained weights and the model's final parameters. To address these limitations, we introduce ALTER: All-in-One Layer Pruning and Temporal Expert Routing, a unified framework that transforms diffusion models into a mixture of efficient temporal experts. ALTER achieves a single-stage optimization that unifies layer pruning, expert routing, and model fine-tuning by employing a trainable hypernetwork, which dynamically generates layer pruning decisions and manages timestep routing to specialized, pruned expert sub-networks throughout the ongoing fine-tuning of the UNet. This unified co-optimization strategy enables significant efficiency gains while preserving high generative quality. Specifically, ALTER achieves same-level visual fidelity to the original 50-step Stable Diffusion v2.1 model while utilizing only 25.9% of its total MACs with just 20 inference steps and delivering a 3.64x speedup through 35% sparsity.

Paper Structure

This paper contains 32 sections, 10 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison of model utilization in dynamic pruning. Sample-wise pruning can only use a static part of the model for one-specific image generation, while ALTER aims to achieve the full utilization of model capacity according to the necessity of each timestep.
  • Figure 2: Overview of the ALTER framework. ALTER is a temporal-adaptive-pruning framework for diffusion models, where a hypernetwork generates layer-wise pruning configurations for expert subnetworks and assigns each denoising timestep to a corresponding expert.
  • Figure 3: A qualitative comparison with original SDv2.1 and BK-SDM-Small. SDv2.1 and BK-SDM-Small adopt the 25-step PNDM while our method adopts the 20-step inference.
  • Figure 4: ALTER (0.65)'s temporal experts and router behavior. (a) Visualization of pruning patterns for $N_e=10$ experts. (b) Visualization of timestep-to-expert routing dynamics. (c) Ablation results for the number of experts on CC3M and MS-COCO 2014.
  • Figure 5: The training dynamics given different ratios $p$ and $\mathcal{\lambda}_{\text{ratio}}$ weights.
  • ...and 1 more figures