Table of Contents
Fetching ...

CFG-EC: Error Correction Classifier-Free Guidance

Nakkyu Yang, Yechan Lee, SooJean Han

TL;DR

CFG-EC tackles the training-sampling mismatch in classifier-free guidance by proactively correcting the unconditional noise prediction. It uses Gram-Schmidt orthogonalization to make the unconditional error orthogonal to the conditional error, thereby eliminating the inner-product term that degrades sampling quality and tightening the error bound. Empirical results on SDXL/SD1.5 with MSCOCO show improved FID and CLIP, especially at low guidance, with a dynamic variant offering the best balance between fidelity and prompt alignment. The approach is versatile and can augment CFG-based methods, providing a robust path toward higher-fidelity, more text-aligned image generation.

Abstract

Classifier-Free Guidance (CFG) has become a mainstream approach for simultaneously improving prompt fidelity and generation quality in conditional generative models. During training, CFG stochastically alternates between conditional and null prompts to enable both conditional and unconditional generation. However, during sampling, CFG outputs both null and conditional prompts simultaneously, leading to inconsistent noise estimates between the training and sampling processes. To reduce this error, we propose CFG-EC, a versatile correction scheme augmentable to any CFG-based method by refining the unconditional noise predictions. CFG-EC actively realigns the unconditional noise error component to be orthogonal to the conditional error component. This corrective maneuver prevents interference between the two guidance components, thereby constraining the sampling error's upper bound and establishing more reliable guidance trajectories for high-fidelity image generation. Our numerical experiments show that CFG-EC handles the unconditional component more effectively than CFG and CFG++, delivering a marked performance increase in the low guidance sampling regime and consistently higher prompt alignment across the board.

CFG-EC: Error Correction Classifier-Free Guidance

TL;DR

CFG-EC tackles the training-sampling mismatch in classifier-free guidance by proactively correcting the unconditional noise prediction. It uses Gram-Schmidt orthogonalization to make the unconditional error orthogonal to the conditional error, thereby eliminating the inner-product term that degrades sampling quality and tightening the error bound. Empirical results on SDXL/SD1.5 with MSCOCO show improved FID and CLIP, especially at low guidance, with a dynamic variant offering the best balance between fidelity and prompt alignment. The approach is versatile and can augment CFG-based methods, providing a robust path toward higher-fidelity, more text-aligned image generation.

Abstract

Classifier-Free Guidance (CFG) has become a mainstream approach for simultaneously improving prompt fidelity and generation quality in conditional generative models. During training, CFG stochastically alternates between conditional and null prompts to enable both conditional and unconditional generation. However, during sampling, CFG outputs both null and conditional prompts simultaneously, leading to inconsistent noise estimates between the training and sampling processes. To reduce this error, we propose CFG-EC, a versatile correction scheme augmentable to any CFG-based method by refining the unconditional noise predictions. CFG-EC actively realigns the unconditional noise error component to be orthogonal to the conditional error component. This corrective maneuver prevents interference between the two guidance components, thereby constraining the sampling error's upper bound and establishing more reliable guidance trajectories for high-fidelity image generation. Our numerical experiments show that CFG-EC handles the unconditional component more effectively than CFG and CFG++, delivering a marked performance increase in the low guidance sampling regime and consistently higher prompt alignment across the board.

Paper Structure

This paper contains 15 sections, 15 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: Comparison of Text-to-Image (T2I) generation results from Stable Diffusion XL SDXL with DDIM ddim 50 NFEs, demonstrating our method's improved image fidelity. The CFG++ baseline (a) yields structural and textual artifacts (e.g., a third tusk, garbled text), our approach (b) achieves significantly higher visual fidelity and enhanced detail.
  • Figure 2: Cosine similarity of the error vectors before and after correction by each method. The Full Method (a) sets the cosine similarity to 0, whereas the Dynamic Method (b) adaptively adjusts it to dynamically control the strength of the correction.
  • Figure 3: Comparison of SDXL T2I generation with the DDIM 50 NFEs. The baseline (a) exhibits a significant structural artifact, featuring a physically incorrect railway track. In contrast, the result in (b) presents a coherent structure.
  • Figure 4: The corresponding SDXL ($\omega$ = 0.6) T2I images generated using CFG++ (left) and CFG-EC++ (right) demonstrate improvements. The generated images exhibit fewer artifacts and better prompt alignment.
  • Figure 5: Evolution of denoised estimates generated by SDXL ($\omega$ = 5.0). CFG-EC (bottom) shows better adherence to text alignment during the denoising process compared to CFG (top).