Table of Contents
Fetching ...

CFG-Zero*: Improved Classifier-Free Guidance for Flow Matching Models

Weichen Fan, Amber Yijia Zheng, Raymond A. Yeh, Ziwei Liu

TL;DR

CFG-Zero* addresses the limitations of classifier-free guidance in flow-matching diffusion models by mitigating velocity estimation errors. It introduces two low-overhead improvements: an optimized scalar to adjust the unconditional velocity and a zero-init strategy that omits the first ODE steps during early sampling. The approach is analyzed on Gaussian mixtures and validated on ImageNet-256 and large-scale text-to-image/video benchmarks, showing consistent gains in perceptual quality and text alignment with minimal computational cost. The results suggest CFG-Zero* as a practical enhancement for controllable, flow-based generation across image and video tasks.

Abstract

Classifier-Free Guidance (CFG) is a widely adopted technique in diffusion/flow models to improve image fidelity and controllability. In this work, we first analytically study the effect of CFG on flow matching models trained on Gaussian mixtures where the ground-truth flow can be derived. We observe that in the early stages of training, when the flow estimation is inaccurate, CFG directs samples toward incorrect trajectories. Building on this observation, we propose CFG-Zero*, an improved CFG with two contributions: (a) optimized scale, where a scalar is optimized to correct for the inaccuracies in the estimated velocity, hence the * in the name; and (b) zero-init, which involves zeroing out the first few steps of the ODE solver. Experiments on both text-to-image (Lumina-Next, Stable Diffusion 3, and Flux) and text-to-video (Wan-2.1) generation demonstrate that CFG-Zero* consistently outperforms CFG, highlighting its effectiveness in guiding Flow Matching models. (Code is available at github.com/WeichenFan/CFG-Zero-star)

CFG-Zero*: Improved Classifier-Free Guidance for Flow Matching Models

TL;DR

CFG-Zero* addresses the limitations of classifier-free guidance in flow-matching diffusion models by mitigating velocity estimation errors. It introduces two low-overhead improvements: an optimized scalar to adjust the unconditional velocity and a zero-init strategy that omits the first ODE steps during early sampling. The approach is analyzed on Gaussian mixtures and validated on ImageNet-256 and large-scale text-to-image/video benchmarks, showing consistent gains in perceptual quality and text alignment with minimal computational cost. The results suggest CFG-Zero* as a practical enhancement for controllable, flow-based generation across image and video tasks.

Abstract

Classifier-Free Guidance (CFG) is a widely adopted technique in diffusion/flow models to improve image fidelity and controllability. In this work, we first analytically study the effect of CFG on flow matching models trained on Gaussian mixtures where the ground-truth flow can be derived. We observe that in the early stages of training, when the flow estimation is inaccurate, CFG directs samples toward incorrect trajectories. Building on this observation, we propose CFG-Zero*, an improved CFG with two contributions: (a) optimized scale, where a scalar is optimized to correct for the inaccuracies in the estimated velocity, hence the * in the name; and (b) zero-init, which involves zeroing out the first few steps of the ODE solver. Experiments on both text-to-image (Lumina-Next, Stable Diffusion 3, and Flux) and text-to-video (Wan-2.1) generation demonstrate that CFG-Zero* consistently outperforms CFG, highlighting its effectiveness in guiding Flow Matching models. (Code is available at github.com/WeichenFan/CFG-Zero-star)

Paper Structure

This paper contains 19 sections, 11 equations, 16 figures, 8 tables, 1 algorithm.

Figures (16)

  • Figure 1: Comparison for the prompt: "A dense winter forest with snow-covered branches, the golden light of dawn filtering through the trees, and a lone fox leaving delicate paw prints in the fresh snow." Images generated using SD3.5 sd3 with CFG and $\text{CFG-Zero$^\star$}$(Ours).
  • Figure 2: (Left) Conditional generation. (Right) CFG generation. (Prompt: "A mysterious underwater city with bioluminescent corals and towering glass domes.")
  • Figure 3: Results on mixture of Gaussians in ${\mathbb{R}}^2$.Left: The Jensen–Shannon divergence between the model's final flow sample distribution and the target distribution v.s. training epoch. Right: The velocity error norm $\|\tilde{{\bm{v}}}^\theta_0 - \tilde{{\bm{v}}}^\ast_0\|$, with the ground truth norm shown in gray v.s. training epoch.
  • Figure 4: Qualitative comparisons between CFG and $\text{CFG-Zero$^\star$}$. Experiments are conducted using Lumina-Next, Stable Diffusion 3, and Stable Diffusion 3.5, with each model evaluated under its recommended optimal sampling steps and guidance scale settings. CFG results are shown in orange and Ours are highlighted in green boxes.
  • Figure 5: User study on Lumina-Next, Stable Diffusion 3, Stable Diffusion 3.5, and Flux. The win rate of our method compared to CFG is presented.
  • ...and 11 more figures