Table of Contents
Fetching ...

Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers

Jian Ma, Qirong Peng, Xujie Zhu, Peixing Xie, Chen Chen, Haonan Lu

TL;DR

This work tackles the high computational cost of diffusion transformers by introducing PPCL, a pluggable pruning framework that first identifies contiguous redundant layer intervals using linear probes and CK A trajectory analysis, then applies non-sequential inter-layer distillation for depth-wise pruning and lightweight linear projectors for width-wise pruning. The approach enables substantial parameter reduction (down to about 30–50% of the original) with minimal performance loss (typically <3%) and provides practical speedups and memory savings, including plug-and-play variants that can be derived from a smaller base model. PPCL is validated across multiple Multi-Modal Diffusion Transformer models, outperforming several prior pruning methods in both objective metrics and subjective quality, while maintaining strong text–image alignment. The work advances deployable diffusion-based systems in resource-constrained environments and offers open-source code to foster reproducibility and further research.

Abstract

Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resource-constrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise pruning within a single training phase. This distillation framework enables flexible knowledge transfer across diverse pruning ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50\% reduction in parameter count compared to the full model, with less than 3\% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher compression ratios, rendering it well-suited for resource-constrained environments. The open-source code, checkpoints for PPCL can be found at the following link: https://github.com/OPPO-Mente-Lab/Qwen-Image-Pruning.

Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers

TL;DR

This work tackles the high computational cost of diffusion transformers by introducing PPCL, a pluggable pruning framework that first identifies contiguous redundant layer intervals using linear probes and CK A trajectory analysis, then applies non-sequential inter-layer distillation for depth-wise pruning and lightweight linear projectors for width-wise pruning. The approach enables substantial parameter reduction (down to about 30–50% of the original) with minimal performance loss (typically <3%) and provides practical speedups and memory savings, including plug-and-play variants that can be derived from a smaller base model. PPCL is validated across multiple Multi-Modal Diffusion Transformer models, outperforming several prior pruning methods in both objective metrics and subjective quality, while maintaining strong text–image alignment. The work advances deployable diffusion-based systems in resource-constrained environments and offers open-source code to foster reproducibility and further research.

Abstract

Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resource-constrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise pruning within a single training phase. This distillation framework enables flexible knowledge transfer across diverse pruning ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50\% reduction in parameter count compared to the full model, with less than 3\% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher compression ratios, rendering it well-suited for resource-constrained environments. The open-source code, checkpoints for PPCL can be found at the following link: https://github.com/OPPO-Mente-Lab/Qwen-Image-Pruning.

Paper Structure

This paper contains 19 sections, 15 equations, 13 figures, 6 tables, 2 algorithms.

Figures (13)

  • Figure 1: Visual comparison of Qwen-Image and its progressive pruned variants. Columns 1 and 4 show the original 20B-parameter Qwen-Image; Columns 2 and 5 show the 70% parameter variant. Columns 3 and 6 show the 50% variant. Results demonstrate that pruned variants retain generation quality comparable to the original model in color rendering, fine-grained text details, and facial feature synthesis.
  • Figure 2: Performance of Qwen-Image on LongText-Bench under three layer removal strategies: individual layers, contiguous layers, and non-contiguous layers. The x-axis denotes the index of the removed layer(s), and the y-axis indicates the accuracy. The pale-yellow zone in the lower-left indicates the mean accuracy for each of the three removal strategies.
  • Figure 3: (a) Depth-wise pruning: Stage 1.1 performs linear probing training for each MMDiT block. Stage 1.2 simulates pruning training to assess the continuity between adjacent MMDiT blocks by tracking the first-order difference of CKA between each block outputs and its corresponding linear probe outputs. A decreasing first-order difference indicates a contiguous layer, while a sudden increase suggests a break. The length represents the value of the first-order difference. Stage 1.3 conducts feature distillation, with the inputs to the student model taken from the same contiguous layer unit. (b) Width-wise pruning: We prune both stream-level and FFN redundancy in MMDiT.
  • Figure 4: Subjective comparison of complex text rendering in Qwen-Image when randomly removing contiguous and non-contiguous blocks. Columns 1 and 3 show the results for contiguous layer removal, while columns 2 and 4 correspond to non-contiguous layer removal.
  • Figure 5: MMDiT's text/image stream CKA heatmaps: Text stream shows high cross-layer similarity with substantial redundancy; Image stream exhibits smooth diagonal similarity decay, reflecting sequential feature evolution with minimal redundancy.
  • ...and 8 more figures