Table of Contents
Fetching ...

PersonalQ: Select, Quantize, and Serve Personalized Diffusion Models for Efficient Inference

Qirui Wang, Qi Guo, Yiding Sun, Junkai Yang, Dongxu Zhang, Shanmin Pang, Qing Guo

Abstract

Personalized text-to-image generation lets users fine-tune diffusion models into repositories of concept-specific checkpoints, but serving these repositories efficiently is difficult for two reasons: natural-language requests are often ambiguous and can be misrouted to visually similar checkpoints, and standard post-training quantization can distort the fragile representations that encode personalized concepts. We present PersonalQ, a unified framework that connects checkpoint selection and quantization through a shared signal -- the checkpoint's trigger token. Check-in performs intent-aligned selection by combining intent-aware hybrid retrieval with LLM-based reranking over checkpoint context and asks a brief clarification question only when multiple intents remain plausible; it then rewrites the prompt by inserting the selected checkpoint's canonical trigger. Complementing this, Trigger-Aware Quantization (TAQ) applies trigger-aware mixed precision in cross-attention, preserving trigger-conditioned key/value rows (and their attention weights) while aggressively quantizing the remaining pathways for memory-efficient inference. Experiments show that PersonalQ improves intent alignment over retrieval and reranking baselines, while TAQ consistently offers a stronger compression-quality trade-off than prior diffusion PTQ methods, enabling scalable serving of personalized checkpoints without sacrificing fidelity.

PersonalQ: Select, Quantize, and Serve Personalized Diffusion Models for Efficient Inference

Abstract

Personalized text-to-image generation lets users fine-tune diffusion models into repositories of concept-specific checkpoints, but serving these repositories efficiently is difficult for two reasons: natural-language requests are often ambiguous and can be misrouted to visually similar checkpoints, and standard post-training quantization can distort the fragile representations that encode personalized concepts. We present PersonalQ, a unified framework that connects checkpoint selection and quantization through a shared signal -- the checkpoint's trigger token. Check-in performs intent-aligned selection by combining intent-aware hybrid retrieval with LLM-based reranking over checkpoint context and asks a brief clarification question only when multiple intents remain plausible; it then rewrites the prompt by inserting the selected checkpoint's canonical trigger. Complementing this, Trigger-Aware Quantization (TAQ) applies trigger-aware mixed precision in cross-attention, preserving trigger-conditioned key/value rows (and their attention weights) while aggressively quantizing the remaining pathways for memory-efficient inference. Experiments show that PersonalQ improves intent alignment over retrieval and reranking baselines, while TAQ consistently offers a stronger compression-quality trade-off than prior diffusion PTQ methods, enabling scalable serving of personalized checkpoints without sacrificing fidelity.
Paper Structure (11 sections, 6 equations, 4 figures, 6 tables)

This paper contains 11 sections, 6 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Problem and overview. (a) Natural-language prompts must be routed to the correct personalized checkpoint; existing methods often misroute queries. (b) Standard post-training quantization degrades personalized concepts.
  • Figure 2: Trigger token (e.g., <sks>) is vulnerable under quantization. Following the token-specific sensitivity test in Sec. \ref{['sec:TAQ']}, we quantize only the cross-attention Key/Value rows of the trigger token while keeping all other rows in full precision. Under 4-bit quantization, the trigger token exhibits much larger degradation than ordinary words, motivating our TAQ method.
  • Figure 3: PersonalQ.Input: prompt $p$ and a personalized checkpoint repository. Output: image $\hat{\mathbf{y}}$ from low-bit inference. (a) Check-in selects a checkpoint $c^\star$ using prompt semantics and metadata, then rewrites $p$ into $p'$ by inserting the checkpoint’s trigger (e.g., bear$\rightarrow$< bear-v4>). (b) TAQ quantizes cross-attention with trigger-aware mixed precision: it keeps trigger-conditioned K/V rows (and their attention pathways) in higher precision and quantizes the remaining activations for efficient inference with preserved personalization.
  • Figure 4: Qualitative comparison. (a) SDXL-Turbo with W8A8 (8-bit weights / 8-bit activations). (b) Stable Diffusion v1.5 with W8A8 (8-bit weights / 8-bit activations); Each column corresponds to a different personalized checkpoint.