Prompt-aware classifier free guidance for diffusion models
Xuanhao Zhang, Chang Li
TL;DR
The paper addresses the limitation of a fixed CFG scale in diffusion models across prompts and modalities. It proposes a prompt-aware framework that constructs a synthetic multi-scale dataset and trains a lightweight predictor to estimate per-scale quality from semantic embeddings and linguistic complexity, enabling inference-time scale selection without retraining. The method chooses the optimal scale via a regularized utility objective and uses classifier-free guidance with that scale, improving fidelity, alignment, and perceptual quality on image (MSCOCO 2014 with SDXL) and audio (AudioCaps with AudioLDM2). Experiments show consistent gains over vanilla CFG and No Guidance, demonstrating a practical, training-free enhancement for pretrained diffusion backbones.
Abstract
Diffusion models have achieved remarkable progress in image and audio generation, largely due to Classifier-Free Guidance. However, the choice of guidance scale remains underexplored: a fixed scale often fails to generalize across prompts of varying complexity, leading to oversaturation or weak alignment. We address this gap by introducing a prompt-aware framework that predicts scale-dependent quality and selects the optimal guidance at inference. Specifically, we construct a large synthetic dataset by generating samples under multiple scales and scoring them with reliable evaluation metrics. A lightweight predictor, conditioned on semantic embeddings and linguistic complexity, estimates multi-metric quality curves and determines the best scale via a utility function with regularization. Experiments on MSCOCO~2014 and AudioCaps show consistent improvements over vanilla CFG, enhancing fidelity, alignment, and perceptual preference. This work demonstrates that prompt-aware scale selection provides an effective, training-free enhancement for pretrained diffusion backbones.
