Generalizing Alignment Paradigm of Text-to-Image Generation with Preferences through $f$-divergence Minimization

Haoyuan Sun; Bo Xia; Yongzhe Chang; Xueqian Wang

Generalizing Alignment Paradigm of Text-to-Image Generation with Preferences through $f$-divergence Minimization

Haoyuan Sun, Bo Xia, Yongzhe Chang, Xueqian Wang

TL;DR

Comprehensive evaluation on text-to-image alignment performance, human value alignment performance and generation diversity performance under different divergence constraints, and the results indicate that text-to-image alignment based on Jensen-Shannon divergence achieves the best trade-off among them.

Abstract

Direct Preference Optimization (DPO) has recently expanded its successful application from aligning large language models (LLMs) to aligning text-to-image models with human preferences, which has generated considerable interest within the community. However, we have observed that these approaches rely solely on minimizing the reverse Kullback-Leibler divergence during alignment process between the fine-tuned model and the reference model, neglecting the incorporation of other divergence constraints. In this study, we focus on extending reverse Kullback-Leibler divergence in the alignment paradigm of text-to-image models to $f$-divergence, which aims to garner better alignment performance as well as good generation diversity. We provide the generalized formula of the alignment paradigm under the $f$-divergence condition and thoroughly analyze the impact of different divergence constraints on alignment process from the perspective of gradient fields. We conduct comprehensive evaluation on image-text alignment performance, human value alignment performance and generation diversity performance under different divergence constraints, and the results indicate that alignment based on Jensen-Shannon divergence achieves the best trade-off among them. The option of divergence employed for aligning text-to-image models significantly impacts the trade-off between alignment performance (especially human value alignment) and generation diversity, which highlights the necessity of selecting an appropriate divergence for practical applications.

Generalizing Alignment Paradigm of Text-to-Image Generation with Preferences through $f$-divergence Minimization

TL;DR

Abstract

-divergence, which aims to garner better alignment performance as well as good generation diversity. We provide the generalized formula of the alignment paradigm under the

-divergence condition and thoroughly analyze the impact of different divergence constraints on alignment process from the perspective of gradient fields. We conduct comprehensive evaluation on image-text alignment performance, human value alignment performance and generation diversity performance under different divergence constraints, and the results indicate that alignment based on Jensen-Shannon divergence achieves the best trade-off among them. The option of divergence employed for aligning text-to-image models significantly impacts the trade-off between alignment performance (especially human value alignment) and generation diversity, which highlights the necessity of selecting an appropriate divergence for practical applications.

Paper Structure (35 sections, 9 theorems, 83 equations, 6 figures, 7 tables)

This paper contains 35 sections, 9 theorems, 83 equations, 6 figures, 7 tables.

Introduction
Related Work
Aligning Text-to-Image Model with Preferences
$f$-divergence utilized in Generation Models
Preliminary
$f$-divergence
Method
Generalized Formula
Analysis on Gradient Fields of Alignment Process
Experiments
Experimental Settings
Benchmark.
Evaluation Metrics.
Image-Text Alignment (For Q1)
Human Value Alignment (For Q2)
...and 20 more sections

Key Result

Theorem 1

If $p_{\mathrm{ref}}(x_{0:T}|c) > 0$ holds for all condition $c$, $f'(x)$ is an invertible function and $0$ is not in the definition domain of function $f'(x)$, the reward class consistent with Bradley-Terrry model can be reparameterized with the sampling probability $p_{\theta}(x_{0:T}|c)$ and the

Figures (6)

Figure 1: Examples of image generated by the model aligned using the Jensen-Shannon divergence constraint.
Figure 2: Landscapes' visualization of alignment objective functions with different divergences from two viewing angles and gradient fields' visualization of the corresponding loss function on the plane $Z=50$.
Figure :
Figure :
Figure :
...and 1 more figures

Theorems & Definitions (15)

Theorem 1
Theorem 2
Theorem 3
Theorem 4
Theorem 1
Theorem 2
Theorem 3
Remark 3.1
Theorem 4
Theorem 5
...and 5 more

Generalizing Alignment Paradigm of Text-to-Image Generation with Preferences through $f$-divergence Minimization

TL;DR

Abstract

Generalizing Alignment Paradigm of Text-to-Image Generation with Preferences through $f$-divergence Minimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (15)