Table of Contents
Fetching ...

Diff-ES: Stage-wise Structural Diffusion Pruning via Evolutionary Search

Zongfang Liu, Shengkun Tang, Zongliang Wu, Xin Yuan, Zhiqiang Shen

TL;DR

Stage-wise structural Diff-ES is introduced, a stage-wise structural diffusion model pruning framework via Evolutionary search, which optimizes the stage-wise sparsity schedule and executes it through memory-efficient weight routing without model duplication.

Abstract

Diffusion models have achieved remarkable success in high-fidelity image generation but remain computationally demanding due to their multi-step denoising process and large model sizes. Although prior work improves efficiency either by reducing sampling steps or by compressing model parameters, existing structured pruning approaches still struggle to balance real acceleration and image quality preservation. In particular, prior methods such as MosaicDiff rely on heuristic, manually tuned stage-wise sparsity schedules and stitch multiple independently pruned models during inference, which increases memory overhead. However, the importance of diffusion steps is highly non-uniform and model-dependent. As a result, schedules derived from simple heuristics or empirical observations often fail to generalize and may lead to suboptimal performance. To this end, we introduce \textbf{Diff-ES}, a stage-wise structural \textbf{Diff}usion pruning framework via \textbf{E}volutionary \textbf{S}earch, which optimizes the stage-wise sparsity schedule and executes it through memory-efficient weight routing without model duplication. Diff-ES divides the diffusion trajectory into multiple stages, automatically discovers an optimal stage-wise sparsity schedule via evolutionary search, and activates stage-conditioned weights dynamically without duplicating model parameters. Our framework naturally integrates with existing structured pruning methods for diffusion models including depth and width pruning. Extensive experiments on DiT and SDXL demonstrate that Diff-ES consistently achieves wall-clock speedups while incurring minimal degradation in generation quality, establishing state-of-the-art performance for structured diffusion model pruning.

Diff-ES: Stage-wise Structural Diffusion Pruning via Evolutionary Search

TL;DR

Stage-wise structural Diff-ES is introduced, a stage-wise structural diffusion model pruning framework via Evolutionary search, which optimizes the stage-wise sparsity schedule and executes it through memory-efficient weight routing without model duplication.

Abstract

Diffusion models have achieved remarkable success in high-fidelity image generation but remain computationally demanding due to their multi-step denoising process and large model sizes. Although prior work improves efficiency either by reducing sampling steps or by compressing model parameters, existing structured pruning approaches still struggle to balance real acceleration and image quality preservation. In particular, prior methods such as MosaicDiff rely on heuristic, manually tuned stage-wise sparsity schedules and stitch multiple independently pruned models during inference, which increases memory overhead. However, the importance of diffusion steps is highly non-uniform and model-dependent. As a result, schedules derived from simple heuristics or empirical observations often fail to generalize and may lead to suboptimal performance. To this end, we introduce \textbf{Diff-ES}, a stage-wise structural \textbf{Diff}usion pruning framework via \textbf{E}volutionary \textbf{S}earch, which optimizes the stage-wise sparsity schedule and executes it through memory-efficient weight routing without model duplication. Diff-ES divides the diffusion trajectory into multiple stages, automatically discovers an optimal stage-wise sparsity schedule via evolutionary search, and activates stage-conditioned weights dynamically without duplicating model parameters. Our framework naturally integrates with existing structured pruning methods for diffusion models including depth and width pruning. Extensive experiments on DiT and SDXL demonstrate that Diff-ES consistently achieves wall-clock speedups while incurring minimal degradation in generation quality, establishing state-of-the-art performance for structured diffusion model pruning.
Paper Structure (43 sections, 11 equations, 5 figures, 11 tables)

This paper contains 43 sections, 11 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Comparison of sparsity ($1 - \text{density}$) schedules. An effective stage-wise sparsity schedule is crucial for maintaining image quality. Using the same structural second-order pruning method, our Diff-ES framework significantly outperforms MosaicDiff on SDXL by employing an optimized, adaptive stage-wise sparsity schedule. For fair comparison, the original 3-stage schedule of MosaicDiff is visualized in 10.
  • Figure 2: Visual comparison on SDXL-base-1.0 under 30% sparsity. Each row shows generations from the same prompts using 20 sampling steps (CFG 7.5): Dense, OBS-Diff, MosaicDiff with OBS, and Diff-ES with OBS. At the same sparsity level, Diff-ES remains closest to the dense model, preserving object identity, scene layout, and fine textures. OBS-Diff already shows clear structural differences with dense model (e.g., the teddy bear is rendered with three legs), while MosaicDiff exhibits much more severe semantic and perceptual degradation. This visual trend is consistent with FID.
  • Figure 3: Overview of Diff-ES. We divide the diffusion process into stages and evolve an optimal stage-wise sparsity schedule under a fixed global budget. A level-switch mutation redistributes sparsity across stages, while lightweight fitness evaluation (e.g., TOPIQ/CLIP-IQA/SSIM) guides survivor selection. For each stage, SNR-aware calibration supports pruning methods to obtain calibrated projection orders. For methods with weight updating such as OBS, all updated weights are stored once in a compact database and retrieved during evaluation through our efficient weight-routing mechanism, enabling rapid model assembly without recomputing second-order updates.
  • Figure 4: Stage-wise sparsity schedules of Diff-ES and MosaicDiff across models.DiT:Diff-ES vs MosaicDiff (FID: 3.63 vs 4.01); SDXL:Diff-ES vs MosaicDiff (FID: 33.12 vs 98.56). On DiT, the sparsity schedules are similar between MosaicDiff and Diff-ES. On SDXL, the sparsity schedules diverge, indicating that MosaicDiff's empirical schedule overfits to DiT. Stages 0--9 correspond to timesteps 0--999; sampling starts at timestep 999, and earlier stages correspond to earlier starting timesteps.
  • Figure 5: Out-of-distribution visual examples on AI-generated prompts (disjoint from the MS-COCO search set) across sparsity levels under the same fitness metric (CLIP-IQA). Diff-ES preserves semantic consistency and fine details up to moderate sparsity, with more visible degradation mainly at the highest sparsity level.