Elucidating Optimal Reward-Diversity Tradeoffs in Text-to-Image Diffusion Models

Rohit Jena; Ali Taghibakhshi; Sahil Jain; Gerald Shen; Nima Tajbakhsh; Arash Vahdat

Elucidating Optimal Reward-Diversity Tradeoffs in Text-to-Image Diffusion Models

Rohit Jena, Ali Taghibakhshi, Sahil Jain, Gerald Shen, Nima Tajbakhsh, Arash Vahdat

TL;DR

This work tackles reward hacking in text-to-image diffusion models trained on human-preference rewards by proving its inevitability under expected reward maximization and analyzing traditional regularizations like KL divergence and LoRA scaling. It introduces Annealed Importance Guidance (AIG), an AIS-inspired inference-time regularization that gradually shifts influence from the base model to the reward-finetuned model across the sampling timeline, thereby preserving diversity while improving reward alignment. Thorough experiments on Stable Diffusion v1.4 and SDXL with PickScore and HPSv2 rewards show that AIG achieves Pareto-optimal reward-diversity tradeoffs and maintains image-text alignment, as validated by a user study across architectures and reward functions. The work also proposes a Spectral Distance-based coverage metric to account for reference-mismatch and demonstrates that AIG outperforms KL and LoRA in most settings, offering a practical, inexpensive finetuning method for diffusion models with human-preference objectives.

Abstract

Text-to-image (T2I) diffusion models have become prominent tools for generating high-fidelity images from text prompts. However, when trained on unfiltered internet data, these models can produce unsafe, incorrect, or stylistically undesirable images that are not aligned with human preferences. To address this, recent approaches have incorporated human preference datasets to fine-tune T2I models or to optimize reward functions that capture these preferences. Although effective, these methods are vulnerable to reward hacking, where the model overfits to the reward function, leading to a loss of diversity in the generated images. In this paper, we prove the inevitability of reward hacking and study natural regularization techniques like KL divergence and LoRA scaling, and their limitations for diffusion models. We also introduce Annealed Importance Guidance (AIG), an inference-time regularization inspired by Annealed Importance Sampling, which retains the diversity of the base model while achieving Pareto-Optimal reward-diversity tradeoffs. Our experiments demonstrate the benefits of AIG for Stable Diffusion models, striking the optimal balance between reward optimization and image diversity. Furthermore, a user study confirms that AIG improves diversity and quality of generated images across different model architectures and reward functions.

Elucidating Optimal Reward-Diversity Tradeoffs in Text-to-Image Diffusion Models

TL;DR

Abstract

Paper Structure (27 sections, 2 theorems, 25 equations, 39 figures)

This paper contains 27 sections, 2 theorems, 25 equations, 39 figures.

Introduction
Related Work
Method
Problem Setup
Existing Regularizations
KL divergence
LoRA scaling
Annealed Importance Guidance
Experiments
Evaluation Measures
Reward-KL tradeoff
Reward-diversity tradeoff
User Preference Study
Conclusion
Appendix
...and 12 more sections

Key Result

Lemma 1

Inevitability of Reward Hacking In a non-parameteric setting under the expected reward maximization optimization, the optimal probability distribution $p(\mathbf{x}_0|\mathbf{c})$ collapses to $p(\mathbf{x}_0|\mathbf{c}) = \delta(\mathbf{x}_0 - \mathbf{x}_0^{*rc}), \mathbf{x}_0^{*rc} = \arg\max_{\ma

Figures (39)

Figure 1: Illustration of Annealed Importance Guidance. Blue trajectories generate samples from the base distribution $p_b(\mathbf{x})$ but when finetuned with reward models, lead to reward hacking and mode collapse, evidenced by the green trajectories collapsing to only one mode of desired data distribution $p^*(\mathbf{x})$. Our method anneals between the dynamics of the base and DRaFT score functions, showing base-model-like exploration followed by DRaFT-like steady convergence to modes of $p^*(\mathbf{x})$. The level of annealing is controllable by the end-user at inference (see \ref{['sec:aig']}).
Figure 2: DRaFT
Figure 3: SDXL Base
Figure 4: Annealed Importance Guidance (Ours)
Figure 6: Tradeoff between reward and KL divergence. For SD1.4, the model breaks down at higher KL values, indicating that the same $\lambda$ hyperparameter cannot be used universally across architectures or reward models. The SD1.4 model suffers from breakdown even with LoRA finetuning.
...and 34 more figures

Theorems & Definitions (4)

Lemma 1
proof
Lemma 2
proof

Elucidating Optimal Reward-Diversity Tradeoffs in Text-to-Image Diffusion Models

TL;DR

Abstract

Elucidating Optimal Reward-Diversity Tradeoffs in Text-to-Image Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (39)

Theorems & Definitions (4)