Table of Contents
Fetching ...

Prompt-aware classifier free guidance for diffusion models

Xuanhao Zhang, Chang Li

TL;DR

The paper addresses the limitation of a fixed CFG scale in diffusion models across prompts and modalities. It proposes a prompt-aware framework that constructs a synthetic multi-scale dataset and trains a lightweight predictor to estimate per-scale quality from semantic embeddings and linguistic complexity, enabling inference-time scale selection without retraining. The method chooses the optimal scale via a regularized utility objective and uses classifier-free guidance with that scale, improving fidelity, alignment, and perceptual quality on image (MSCOCO 2014 with SDXL) and audio (AudioCaps with AudioLDM2). Experiments show consistent gains over vanilla CFG and No Guidance, demonstrating a practical, training-free enhancement for pretrained diffusion backbones.

Abstract

Diffusion models have achieved remarkable progress in image and audio generation, largely due to Classifier-Free Guidance. However, the choice of guidance scale remains underexplored: a fixed scale often fails to generalize across prompts of varying complexity, leading to oversaturation or weak alignment. We address this gap by introducing a prompt-aware framework that predicts scale-dependent quality and selects the optimal guidance at inference. Specifically, we construct a large synthetic dataset by generating samples under multiple scales and scoring them with reliable evaluation metrics. A lightweight predictor, conditioned on semantic embeddings and linguistic complexity, estimates multi-metric quality curves and determines the best scale via a utility function with regularization. Experiments on MSCOCO~2014 and AudioCaps show consistent improvements over vanilla CFG, enhancing fidelity, alignment, and perceptual preference. This work demonstrates that prompt-aware scale selection provides an effective, training-free enhancement for pretrained diffusion backbones.

Prompt-aware classifier free guidance for diffusion models

TL;DR

The paper addresses the limitation of a fixed CFG scale in diffusion models across prompts and modalities. It proposes a prompt-aware framework that constructs a synthetic multi-scale dataset and trains a lightweight predictor to estimate per-scale quality from semantic embeddings and linguistic complexity, enabling inference-time scale selection without retraining. The method chooses the optimal scale via a regularized utility objective and uses classifier-free guidance with that scale, improving fidelity, alignment, and perceptual quality on image (MSCOCO 2014 with SDXL) and audio (AudioCaps with AudioLDM2). Experiments show consistent gains over vanilla CFG and No Guidance, demonstrating a practical, training-free enhancement for pretrained diffusion backbones.

Abstract

Diffusion models have achieved remarkable progress in image and audio generation, largely due to Classifier-Free Guidance. However, the choice of guidance scale remains underexplored: a fixed scale often fails to generalize across prompts of varying complexity, leading to oversaturation or weak alignment. We address this gap by introducing a prompt-aware framework that predicts scale-dependent quality and selects the optimal guidance at inference. Specifically, we construct a large synthetic dataset by generating samples under multiple scales and scoring them with reliable evaluation metrics. A lightweight predictor, conditioned on semantic embeddings and linguistic complexity, estimates multi-metric quality curves and determines the best scale via a utility function with regularization. Experiments on MSCOCO~2014 and AudioCaps show consistent improvements over vanilla CFG, enhancing fidelity, alignment, and perceptual preference. This work demonstrates that prompt-aware scale selection provides an effective, training-free enhancement for pretrained diffusion backbones.

Paper Structure

This paper contains 4 sections, 10 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: A white car.
  • Figure 3: Overview of the proposed framework. Stage 1 constructs a guidance-based dataset by generating multi-scale samples from both text-to-image (SDXL) and text-to-audio (AudioLDM2) diffusion models. Each sample is evaluated with modality-specific metrics such as CLIP, ImageReward, or AudioBox-Aesthetics. Stage 2 trains a lightweight predictor that integrates semantic embeddings (CLIP/CLAP) with statistical complexity features (e.g., length, entropy). Stage 3 leverages the predictor at inference to select prompt-dependent guidance scales, which are then used in the diffusion reverse process for enhanced sampling.