Table of Contents
Fetching ...

Stage-wise Dynamics of Classifier-Free Guidance in Diffusion Models

Cheng Jin, Qitan Shi, Yuantao Gu

TL;DR

A unified view explains a widely observed phenomenon: stronger guidance improves semantic alignment but inevitably reduces diversity, showing that early strong guidance erodes global diversity, while late strong guidance suppresses fine-grained variation.

Abstract

Classifier-Free Guidance (CFG) is widely used to improve conditional fidelity in diffusion models, but its impact on sampling dynamics remains poorly understood. Prior studies, often restricted to unimodal conditional distributions or simplified cases, provide only a partial picture. We analyze CFG under multimodal conditionals and show that the sampling process unfolds in three successive stages. In the Direction Shift stage, guidance accelerates movement toward the weighted mean, introducing initialization bias and norm growth. In the Mode Separation stage, local dynamics remain largely neutral, but the inherited bias suppresses weaker modes, reducing global diversity. In the Concentration stage, guidance amplifies within-mode contraction, diminishing fine-grained variability. This unified view explains a widely observed phenomenon: stronger guidance improves semantic alignment but inevitably reduces diversity. Experiments support these predictions, showing that early strong guidance erodes global diversity, while late strong guidance suppresses fine-grained variation. Moreover, our theory naturally suggests a time-varying guidance schedule, and empirical results confirm that it consistently improves both quality and diversity.

Stage-wise Dynamics of Classifier-Free Guidance in Diffusion Models

TL;DR

A unified view explains a widely observed phenomenon: stronger guidance improves semantic alignment but inevitably reduces diversity, showing that early strong guidance erodes global diversity, while late strong guidance suppresses fine-grained variation.

Abstract

Classifier-Free Guidance (CFG) is widely used to improve conditional fidelity in diffusion models, but its impact on sampling dynamics remains poorly understood. Prior studies, often restricted to unimodal conditional distributions or simplified cases, provide only a partial picture. We analyze CFG under multimodal conditionals and show that the sampling process unfolds in three successive stages. In the Direction Shift stage, guidance accelerates movement toward the weighted mean, introducing initialization bias and norm growth. In the Mode Separation stage, local dynamics remain largely neutral, but the inherited bias suppresses weaker modes, reducing global diversity. In the Concentration stage, guidance amplifies within-mode contraction, diminishing fine-grained variability. This unified view explains a widely observed phenomenon: stronger guidance improves semantic alignment but inevitably reduces diversity. Experiments support these predictions, showing that early strong guidance erodes global diversity, while late strong guidance suppresses fine-grained variation. Moreover, our theory naturally suggests a time-varying guidance schedule, and empirical results confirm that it consistently improves both quality and diversity.

Paper Structure

This paper contains 44 sections, 9 theorems, 82 equations, 11 figures, 3 tables.

Key Result

Theorem 3.2

Let $\bar{{\boldsymbol{\mu}}}=\sum_{k=1}^K \pi_k{\boldsymbol{\mu}}_k$ denote the class-weighted mean of the Gaussian mixture prior, and assume the same initialization ${\boldsymbol{x}}_1\sim\mathcal{N}(\mathbf 0,{\mathbf{I}})$ for both trajectories. Then for any $\omega>1$, there exists a time point where $\omega$ is the guidance weight, ${\boldsymbol{x}}_t^{(y)}$ is the solution to the conditiona

Figures (11)

  • Figure 1: Illustration of the three-stage dynamics of conditional sampling (Cond-ODE, top row) versus Classifier-Free Guidance (CFG-ODE, bottom row) under a multimodal distribution. In the Direction Shift stage (left), CFG trajectories deviate more strongly toward the global weighted mean, introducing initialization bias. In the Mode Separation stage (middle), Cond-ODE trajectories maintain coverage of multiple modes, while CFG trajectories suppress weaker modes and collapse toward dominant ones. In the Concentration stage (right), CFG trajectories contract excessively within modes, leading to loss of fine-grained diversity. Red dots denote samples, gray arrows connect the start and end points of the same trajectory (indicating their correspondence), and blue crosses mark the weighted mean of conditional modes.
  • Figure 2: Comparison of guidance schedules on the prompt "A view of a bathroom that is clean." The (a) Constant schedule and (c) Early-high schedule both collapse diversity, with most samples converging to layouts dominated by large windows and uniform cool tones. The (b) Early-low schedule mitigates this effect, producing more varied spatial structures and color palettes.
  • Figure 3: Prompt: Futuristic city at sunset, glass towers with neon skybridges, flying cars leaving light trails, woman in silver robe holding holographic tablet on balcony, cinematic lighting, ultra-detailed, 8K render. Under the late-high schedule (a), the highlighted regions reveal cars that are nearly identical across samples, indicating reduced diversity. In contrast, the constant schedule (b) preserves greater variability in both car shapes and positions, which is also reflected by a larger mean squared error (MSE).
  • Figure 4: Generated samples with the prompt "A view of a bathroom that is clean" at high sampling budget (NFE=50). While constant schedules yield semantically consistent but overly uniform results, methods adhering to the low-high-low scheduling principle exhibit significantly higher diversity.
  • Figure 5: Vanilla-CFG, interval-CFG, $\beta$-CFG and TV-CFG guidance scale settings at $\omega=9$.
  • ...and 6 more figures

Theorems & Definitions (15)

  • Theorem 3.2: In-expectation early-stage proximity under CFG
  • Theorem 3.3: Persistence of the weaker mode under CFG
  • Proposition 3.4: Initialization bias from the first stage
  • Theorem 3.5: CFG yields stronger within-mode contraction
  • Lemma C.1
  • Theorem C.2: Theorem\ref{['thm:early-proximity']}
  • proof
  • Remark C.3
  • Theorem C.4: Theorem\ref{['thm:weaker-persistence']}
  • proof
  • ...and 5 more