Table of Contents
Fetching ...

MosaicDiff: Training-free Structural Pruning for Diffusion Model Acceleration Reflecting Pretraining Dynamics

Bowei Guo, Shengkun Tang, Cong Zeng, Zhiqiang Shen

TL;DR

A novel framework called MosaicDiff is introduced that aligns diffusion pretraining dynamics with post-training sampling acceleration via trajectory-aware structural pruning, thereby harmonizing the model's inner training dynamics with its accelerated sampling process.

Abstract

Diffusion models are renowned for their generative capabilities, yet their pretraining processes exhibit distinct phases of learning speed that have been entirely overlooked in prior post-training acceleration efforts in the community. In this study, we introduce a novel framework called MosaicDiff that aligns diffusion pretraining dynamics with post-training sampling acceleration via trajectory-aware structural pruning. Our approach leverages the observation that the middle, fast-learning stage of diffusion pretraining requires more conservative pruning to preserve critical model features, while the early and later, slow-learning stages benefit from a more aggressive pruning strategy. This adaptive pruning mechanism is the first to explicitly mirror the inherent learning speed variations of diffusion pretraining, thereby harmonizing the model's inner training dynamics with its accelerated sampling process. Extensive experiments on DiT and SDXL demonstrate that our method achieves significant speed-ups in sampling without compromising output quality, outperforming previous state-of-the-art methods by large margins, also providing a new viewpoint for more efficient and robust training-free diffusion acceleration.

MosaicDiff: Training-free Structural Pruning for Diffusion Model Acceleration Reflecting Pretraining Dynamics

TL;DR

A novel framework called MosaicDiff is introduced that aligns diffusion pretraining dynamics with post-training sampling acceleration via trajectory-aware structural pruning, thereby harmonizing the model's inner training dynamics with its accelerated sampling process.

Abstract

Diffusion models are renowned for their generative capabilities, yet their pretraining processes exhibit distinct phases of learning speed that have been entirely overlooked in prior post-training acceleration efforts in the community. In this study, we introduce a novel framework called MosaicDiff that aligns diffusion pretraining dynamics with post-training sampling acceleration via trajectory-aware structural pruning. Our approach leverages the observation that the middle, fast-learning stage of diffusion pretraining requires more conservative pruning to preserve critical model features, while the early and later, slow-learning stages benefit from a more aggressive pruning strategy. This adaptive pruning mechanism is the first to explicitly mirror the inherent learning speed variations of diffusion pretraining, thereby harmonizing the model's inner training dynamics with its accelerated sampling process. Extensive experiments on DiT and SDXL demonstrate that our method achieves significant speed-ups in sampling without compromising output quality, outperforming previous state-of-the-art methods by large margins, also providing a new viewpoint for more efficient and robust training-free diffusion acceleration.

Paper Structure

This paper contains 21 sections, 1 theorem, 23 equations, 11 figures, 13 tables.

Key Result

Theorem 1

With the $\bar{\alpha}_t$ from the noise scheduler, the expectation of MSE and gradient can be formulated as :

Figures (11)

  • Figure 1: MosaicDiff is a post-training / training-free structural pruning technique for both transformer-based and U-Net-based diffusion models. It can achieve 0.5 pruning sparsity on linear scheduled 675M DiT-XL/2 and 0.3 pruning sparsity on scaled-linear scheduled 2.6B SDXL-base-1.0 with minimal performance degradation.
  • Figure 2: Overview of MosaicDiff. (a) Main framework: We first divide the inference process into three distinct stages according to a quantitative analysis of pretraining dynamics. For each stage, we utilize SNR-aware calibration data to perform second-order structural pruning, obtaining subnetworks with varying degrees of sparsity. Finally, we integrate these subnetworks to enable efficient inference across all timesteps. (b) Second-order structural pruning: To practically implement pruning on diffusion models, we feed SNR-aware calibration data into the pretrained model, computing Hessian matrices for each Attention and MLP layer. We then derive saliency scores from these Hessians to prune less important weight columns, corresponding to heads in multi-head self-attention (MHSA) layers and neurons in intermediate MLP layers.
  • Figure 3: Change in image MSE over sampling steps: In the early stage $T \in (600,1000)$, the MSE decreases slowly with images remaining largely noisy, in the middle stage $T \in (200,600)$, denoising accelerates and images converge rapidly, and in the final stage $T \in (0,200)$, MSE reduction slows, indicating only subtle perceptual refinements.
  • Figure 4: MSE and gradient curves comparison under Linear Schedule. Left: MSE calculated from our closed-form approximation closely matches the sampled results. Right: Gradients derived from our closed-form expression align with empirically sampled gradients.
  • Figure 5: Influence of SNR on Final Scores. (a) Change in SNR across sampling steps, showing a sharp increase during the final steps. (b) Final scores computed combining SNR. A threshold of $M=0.55$ clearly divides the curve into three stages.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof