Table of Contents
Fetching ...

Aligning Diffusion Models by Optimizing Human Utility

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka

TL;DR

<3-5 sentence high-level summary>Diffusion-KTO introduces a reward-model-free framework for aligning text-to-image diffusion models by optimizing a Kahneman-Tversky–style utility over per-step, per-image actions using binary feedback (like/dislike). By extending the utility maximization paradigm to diffusion processes, it avoids collecting pairwise preferences and backpropagating through the full denoising trajectory. Empirically, Diffusion-KTO outperforms supervised fine-tuning, Diffusion-DPO, and other baselines across human judgments and automated metrics (PickScore, ImageReward, LAION aesthetics, CLIP) on multiple datasets, demonstrating robust improvements in image fidelity and prompt adherence. The work also shows the method’s flexibility, including synthetic alignment to specific user preferences and generalization to different SD variants, while acknowledging dataset biases and safety limitations as important considerations for real-world deployment.

Abstract

We present Diffusion-KTO, a novel approach for aligning text-to-image diffusion models by formulating the alignment objective as the maximization of expected human utility. Since this objective applies to each generation independently, Diffusion-KTO does not require collecting costly pairwise preference data nor training a complex reward model. Instead, our objective requires simple per-image binary feedback signals, e.g. likes or dislikes, which are abundantly available. After fine-tuning using Diffusion-KTO, text-to-image diffusion models exhibit superior performance compared to existing techniques, including supervised fine-tuning and Diffusion-DPO, both in terms of human judgment and automatic evaluation metrics such as PickScore and ImageReward. Overall, Diffusion-KTO unlocks the potential of leveraging readily available per-image binary signals and broadens the applicability of aligning text-to-image diffusion models with human preferences.

Aligning Diffusion Models by Optimizing Human Utility

TL;DR

<3-5 sentence high-level summary>Diffusion-KTO introduces a reward-model-free framework for aligning text-to-image diffusion models by optimizing a Kahneman-Tversky–style utility over per-step, per-image actions using binary feedback (like/dislike). By extending the utility maximization paradigm to diffusion processes, it avoids collecting pairwise preferences and backpropagating through the full denoising trajectory. Empirically, Diffusion-KTO outperforms supervised fine-tuning, Diffusion-DPO, and other baselines across human judgments and automated metrics (PickScore, ImageReward, LAION aesthetics, CLIP) on multiple datasets, demonstrating robust improvements in image fidelity and prompt adherence. The work also shows the method’s flexibility, including synthetic alignment to specific user preferences and generalization to different SD variants, while acknowledging dataset biases and safety limitations as important considerations for real-world deployment.

Abstract

We present Diffusion-KTO, a novel approach for aligning text-to-image diffusion models by formulating the alignment objective as the maximization of expected human utility. Since this objective applies to each generation independently, Diffusion-KTO does not require collecting costly pairwise preference data nor training a complex reward model. Instead, our objective requires simple per-image binary feedback signals, e.g. likes or dislikes, which are abundantly available. After fine-tuning using Diffusion-KTO, text-to-image diffusion models exhibit superior performance compared to existing techniques, including supervised fine-tuning and Diffusion-DPO, both in terms of human judgment and automatic evaluation metrics such as PickScore and ImageReward. Overall, Diffusion-KTO unlocks the potential of leveraging readily available per-image binary signals and broadens the applicability of aligning text-to-image diffusion models with human preferences.
Paper Structure (37 sections, 8 equations, 13 figures, 7 tables)

This paper contains 37 sections, 8 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Diffusion-KTO is a novel framework for aligning text-to-image diffusion models with human preferences using only per-sample binary feedback. Diffusion-KTO bypasses the need to collect expensive pairwise preference data and avoids training a reward model. As seen above, Diffusion-KTO aligned text-to-image models generate images that better align with human preferences. We display results after fine-tuning Stable Diffusion v1-5 and sampling prompts from HPS v2 wu2023humanv2, Pick-a-Pic kirstain2024pick, and PartiPrompts yu2022scaling datasets.
  • Figure 2: Diffusion-KTO aligns text-to-image diffusion models using per-image binary feedback. Existing alignment approaches (Left) are restricted to learning from pairwise preferences. However, Diffusion-KTO (Right) uses per-image preferences which are abundantly available on the Internet. As seen above, the quality of an image can be assessed independent of another generation for the same prompt. More importantly, such per-image preferences provide valuable signals for aligning T2I models, as demonstrated by our results.
  • Figure 3: We present Diffusion-KTO, which aligns text-to-image diffusion models by extending the utility maximization framework to the setting of diffusion models. Since this framework aims to maximize the utility of each generation ($U(x)$) independently, it does not require paired preference data. Instead, Diffusion-KTO trains with per-image binary feedback signals, e.g. likes and dislikes. Our objective also extends to each step in the diffusion process, thereby avoiding the need to back-propagate a reward through the entire sampling process.
  • Figure 4: User study win-rate (%) comparing Diffusion-KTO (SD v1-5) to SD v1-5, and SFT (SD v1-5) and Diffusion-DPO (SD v1-5). Results of our user study show that Diffusion-KTO significantly improves the alignment of the base SD v1-5 model. Moreover, our Diffusion-KTO aligned model also outperforms supervised finetuning (SFT) and the officially released Diffusion-DPO model, as judged by users, despite only training with simple per-image binary feedback. We also include the 95% confidence interval of the win-rate.
  • Figure 5: Side-by-side comparison of images generated by related methods using SD v1-5. Diffusion-KTO demonstrates a significant improvement in terms of aesthetic appeal and fidelity to the caption (see \ref{['sec:results-qual']}).
  • ...and 8 more figures