Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

Dian Xie; Shitong Shao; Lichen Bai; Zikai Zhou; Bojun Cheng; Shuo Yang; Jun Wu; Zeke Xie

Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

Dian Xie, Shitong Shao, Lichen Bai, Zikai Zhou, Bojun Cheng, Shuo Yang, Jun Wu, Zeke Xie

TL;DR

A critical evaluation pitfall is revealed that common human preference models exhibit a strong bias towards large guidance scales, and a novel guidance-aware evaluation framework is introduced that employs effective guidance scale calibration to enable fair comparison between current guidance methods and CFG.

Abstract

Classifier-free guidance (CFG) has helped diffusion models achieve great conditional generation in various fields. Recently, more diffusion guidance methods have emerged with improved generation quality and human preference. However, can these emerging diffusion guidance methods really achieve solid and significant improvements? In this paper, we rethink recent progress on diffusion guidance. Our work mainly consists of four contributions. First, we reveal a critical evaluation pitfall that common human preference models exhibit a strong bias towards large guidance scales. Simply increasing the CFG scale can easily improve quantitative evaluation scores due to strong semantic alignment, even if image quality is severely damaged (e.g., oversaturation and artifacts). Second, we introduce a novel guidance-aware evaluation (GA-Eval) framework that employs effective guidance scale calibration to enable fair comparison between current guidance methods and CFG by identifying the effects orthogonal and parallel to CFG effects. Third, motivated by the evaluation pitfall, we design Transcendent Diffusion Guidance (TDG) method that can significantly improve human preference scores in the conventional evaluation framework but actually does not work in practice. Fourth, in extensive experiments, we empirically evaluate recent eight diffusion guidance methods within the conventional evaluation framework and the proposed GA-Eval framework. Notably, simply increasing the CFG scales can compete with most studied diffusion guidance methods, while all methods suffer severely from winning rate degradation over standard CFG. Our work would strongly motivate the community to rethink the evaluation paradigm and future directions of this field.

Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

TL;DR

Abstract

Paper Structure (19 sections, 15 equations, 22 figures, 12 tables, 1 algorithm)

This paper contains 19 sections, 15 equations, 22 figures, 12 tables, 1 algorithm.

Introduction
Related Work
Diffusion Model
Diffusion Guidance and Sampling
Evaluation for Text-to-Image Generation
Guidance Matters to Evaluation
Transcendent Diffusion Guidance
Empirical Analysis and Discussion
Experimental Settings
Main Results
Discussion and Analysis
Conclusion
LLM Usage
Experimental Settings
Transcedent Diffusion Guidance
...and 4 more sections

Figures (22)

Figure 1: The HPS v2 scores of generated images under different CFG scales $\omega \in \{5.5, 10, 15, 20\}$, respectively. Model: Stable Diffusion-XL. HPSv2 exhibit a strong bias to large CFG scales in a wide range, even if generation quality starts to degrade due to too strong guidance.
Figure 2: Visual comparison of different methods with their corresponding results under effective guidance scale $\omega^\mathrm{e}$. The e-CFG method, namely standard CFG with effective CFG scales calibrated by GA-Eval, can easily achieve performance improvement comparable to most recent diffusion guidance or sampling methods.
Figure 3: Different evaluation metrics under different guidance scales. Model: Stable Diffusion-XL. Dataset: Pick-a-Pic. Except AES and PickScore, other metrics would give higher ratings on images generated with a larger guidance scale within $\omega\in[5.5, 20]$.
Figure 4: Comparison of classifier-free guidance ho2021classifier and transcendent diffusion guidance (TDG).
Figure 5: The winning rate $\eta^\text{e-CFG}$ compared to effective CFG and their degradation $\Delta\eta$ of different methods on HPD dataset. Left: $\eta^\text{e-CFG}$. Right: $\Delta\eta$. Among all methods, applying $\omega^\mathrm{e}$ has minor influence to APG. Meanwhile, AES demonstrated negative $\Delta\eta$ on many methods, which means AES would give worse ratings to images generated with a large guidance scale.
...and 17 more figures

Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

TL;DR

Abstract

Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (22)