Table of Contents
Fetching ...

Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-Image Diffusion Models

Alireza Ganjdanesh, Reza Shirkavand, Shangqian Gao, Heng Huang

TL;DR

This work introduces Adaptive Prompt-Tailored Pruning (APTP), a prompt-based pruning framework for text-to-image diffusion models that assigns each input prompt to a specialized sub-network (expert) under a total compute budget. A prompt router, comprising a frozen prompt encoder, an architecture predictor, and an OT-based router, maps prompts to architecture codes, enabling batch-parallel inference while diversifying capacity allocation via optimal transport. The method trains the router and codes with a composite objective including DDPM denoising, distillation, resource regularization, and a contrastive loss to cluster semantically similar prompts into nearby codes, with Gumbel-Sigmoid masks enabling differentiable pruning of depth and width. Experiments pruning Stable Diffusion 2.1 on CC3M and COCO demonstrate that APTP outperforms static pruning baselines in FID, CLIP, and CMMD while reducing latency, and analysis shows the router discovers semantically meaningful prompt groups and challenging prompts (e.g., text images) that are routed to higher-capacity experts. Overall, APTP offers a practical, plug-in approach to adapt pretrained T2I models to target data and compute budgets, enabling scalable, batch-efficient deployment. The work highlights the importance of prompt-aware capacity control and provides insights into prompt clustering and expert specialization for diffusion-based generation.

Abstract

Text-to-image (T2I) diffusion models have demonstrated impressive image generation capabilities. Still, their computational intensity prohibits resource-constrained organizations from deploying T2I models after fine-tuning them on their internal target data. While pruning techniques offer a potential solution to reduce the computational burden of T2I models, static pruning methods use the same pruned model for all input prompts, overlooking the varying capacity requirements of different prompts. Dynamic pruning addresses this issue by utilizing a separate sub-network for each prompt, but it prevents batch parallelism on GPUs. To overcome these limitations, we introduce Adaptive Prompt-Tailored Pruning (APTP), a novel prompt-based pruning method designed for T2I diffusion models. Central to our approach is a prompt router model, which learns to determine the required capacity for an input text prompt and routes it to an architecture code, given a total desired compute budget for prompts. Each architecture code represents a specialized model tailored to the prompts assigned to it, and the number of codes is a hyperparameter. We train the prompt router and architecture codes using contrastive learning, ensuring that similar prompts are mapped to nearby codes. Further, we employ optimal transport to prevent the codes from collapsing into a single one. We demonstrate APTP's effectiveness by pruning Stable Diffusion (SD) V2.1 using CC3M and COCO as target datasets. APTP outperforms the single-model pruning baselines in terms of FID, CLIP, and CMMD scores. Our analysis of the clusters learned by APTP reveals they are semantically meaningful. We also show that APTP can automatically discover previously empirically found challenging prompts for SD, e.g. prompts for generating text images, assigning them to higher capacity codes.

Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-Image Diffusion Models

TL;DR

This work introduces Adaptive Prompt-Tailored Pruning (APTP), a prompt-based pruning framework for text-to-image diffusion models that assigns each input prompt to a specialized sub-network (expert) under a total compute budget. A prompt router, comprising a frozen prompt encoder, an architecture predictor, and an OT-based router, maps prompts to architecture codes, enabling batch-parallel inference while diversifying capacity allocation via optimal transport. The method trains the router and codes with a composite objective including DDPM denoising, distillation, resource regularization, and a contrastive loss to cluster semantically similar prompts into nearby codes, with Gumbel-Sigmoid masks enabling differentiable pruning of depth and width. Experiments pruning Stable Diffusion 2.1 on CC3M and COCO demonstrate that APTP outperforms static pruning baselines in FID, CLIP, and CMMD while reducing latency, and analysis shows the router discovers semantically meaningful prompt groups and challenging prompts (e.g., text images) that are routed to higher-capacity experts. Overall, APTP offers a practical, plug-in approach to adapt pretrained T2I models to target data and compute budgets, enabling scalable, batch-efficient deployment. The work highlights the importance of prompt-aware capacity control and provides insights into prompt clustering and expert specialization for diffusion-based generation.

Abstract

Text-to-image (T2I) diffusion models have demonstrated impressive image generation capabilities. Still, their computational intensity prohibits resource-constrained organizations from deploying T2I models after fine-tuning them on their internal target data. While pruning techniques offer a potential solution to reduce the computational burden of T2I models, static pruning methods use the same pruned model for all input prompts, overlooking the varying capacity requirements of different prompts. Dynamic pruning addresses this issue by utilizing a separate sub-network for each prompt, but it prevents batch parallelism on GPUs. To overcome these limitations, we introduce Adaptive Prompt-Tailored Pruning (APTP), a novel prompt-based pruning method designed for T2I diffusion models. Central to our approach is a prompt router model, which learns to determine the required capacity for an input text prompt and routes it to an architecture code, given a total desired compute budget for prompts. Each architecture code represents a specialized model tailored to the prompts assigned to it, and the number of codes is a hyperparameter. We train the prompt router and architecture codes using contrastive learning, ensuring that similar prompts are mapped to nearby codes. Further, we employ optimal transport to prevent the codes from collapsing into a single one. We demonstrate APTP's effectiveness by pruning Stable Diffusion (SD) V2.1 using CC3M and COCO as target datasets. APTP outperforms the single-model pruning baselines in terms of FID, CLIP, and CMMD scores. Our analysis of the clusters learned by APTP reveals they are semantically meaningful. We also show that APTP can automatically discover previously empirically found challenging prompts for SD, e.g. prompts for generating text images, assigning them to higher capacity codes.
Paper Structure (36 sections, 31 equations, 14 figures, 9 tables)

This paper contains 36 sections, 31 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Overview: We prune a text-to-image diffusion model like Stable Diffusion (left) into a mixture of efficient experts (right) in a prompt-based manner. Our prompt router routes distinct types of prompts to different experts, allowing experts' architectures to be separately specialized by removing layers or channels.
  • Figure 2: Our pruning scheme. We train our prompt router and the set of architecture codes to prune a text-to-image diffusion model into a mixture of experts. The prompt router consists of three modules. We use a Sentence Transformer ReimersSentenceBERT as our prompt encoder to encode the input prompt into a representation $z$. Then, the architecture predictor transforms $z$ into the architecture embedding $e$ that has the same dimensionality as architecture codes. Finally, the router routes the embedding $e$ into an architecture code $a^{(i)}.$ We use optimal transport to evenly assign the prompts in a training batch to the architecture codes. The architecture code $a^{(i)}=(u^{(i)}, v^{(i)})$ determines pruning the model's width and depth. We train the prompt router's parameters and architecture codes in an end-to-end manner using the denoising objective of the pruned model $\mathcal{L}_{\text{DDPM}}$, distillation loss between the pruned and original models $\mathcal{L}_{\text{distill}}$, average resource usage for the samples in the batch $\mathcal{R}$, and contrastive objective $\mathcal{L}_{\text{cont}}$, encouraging embeddings $e$ preserving semantic similarity of the representations $z$.
  • Figure 3: Samples of the APTP-Base experts after pruning the Stable Diffusion V2.1 using CC3M sharma2018CC3M and COCO lin2014MSCOCO as the target datasets. Expert IDs are shown on the top right of images. (See Table \ref{['supp:grid-prompts']} for prompts)
  • Figure 4: Comparison of samples generated by low and high budget experts of APTP-Base vs. SD V2.1 on CC3M and MS-COCO validation sets.
  • Figure 5: Ablation Results for the number of experts of APTP on MS-COCO.
  • ...and 9 more figures