Table of Contents
Fetching ...

Towards Better Optimization For Listwise Preference in Diffusion Models

Jiamu Bai, Xin Yu, Meilong Xu, Weitao Lu, Xin Pan, Kiwan Maeng, Daniel Kifer, Jian Wang, Yu Wang

TL;DR

This work addresses the gap in aligning diffusion models to human feedback by moving from pairwise to listwise preferences. By modeling rankings with the Plackett--Luce mechanism, Diffusion-LPO directly optimizes full preference lists, enhancing alignment with human judgments beyond what pairwise DPO offers. The method includes constructing listwise groups from real user data, deriving a PL-based objective, and demonstrating significant improvements in text-to-image quality, editing fidelity, and personalization on SD1.5 and SDXL. The results indicate that listwise supervision yields stronger, more consistent alignment signals and generalizes well to personalized settings, with no need for extra reward evaluators.

Abstract

Reinforcement learning from human feedback (RLHF) has proven effectiveness for aligning text-to-image (T2I) diffusion models with human preferences. Although Direct Preference Optimization (DPO) is widely adopted for its computational efficiency and avoidance of explicit reward modeling, its applications to diffusion models have primarily relied on pairwise preferences. The precise optimization of listwise preferences remains largely unaddressed. In practice, human feedback on image preferences often contains implicit ranked information, which conveys more precise human preferences than pairwise comparisons. In this work, we propose Diffusion-LPO, a simple and effective framework for Listwise Preference Optimization in diffusion models with listwise data. Given a caption, we aggregate user feedback into a ranked list of images and derive a listwise extension of the DPO objective under the Plackett-Luce model. Diffusion-LPO enforces consistency across the entire ranking by encouraging each sample to be preferred over all of its lower-ranked alternatives. We empirically demonstrate the effectiveness of Diffusion-LPO across various tasks, including text-to-image generation, image editing, and personalized preference alignment. Diffusion-LPO consistently outperforms pairwise DPO baselines on visual quality and preference alignment.

Towards Better Optimization For Listwise Preference in Diffusion Models

TL;DR

This work addresses the gap in aligning diffusion models to human feedback by moving from pairwise to listwise preferences. By modeling rankings with the Plackett--Luce mechanism, Diffusion-LPO directly optimizes full preference lists, enhancing alignment with human judgments beyond what pairwise DPO offers. The method includes constructing listwise groups from real user data, deriving a PL-based objective, and demonstrating significant improvements in text-to-image quality, editing fidelity, and personalization on SD1.5 and SDXL. The results indicate that listwise supervision yields stronger, more consistent alignment signals and generalizes well to personalized settings, with no need for extra reward evaluators.

Abstract

Reinforcement learning from human feedback (RLHF) has proven effectiveness for aligning text-to-image (T2I) diffusion models with human preferences. Although Direct Preference Optimization (DPO) is widely adopted for its computational efficiency and avoidance of explicit reward modeling, its applications to diffusion models have primarily relied on pairwise preferences. The precise optimization of listwise preferences remains largely unaddressed. In practice, human feedback on image preferences often contains implicit ranked information, which conveys more precise human preferences than pairwise comparisons. In this work, we propose Diffusion-LPO, a simple and effective framework for Listwise Preference Optimization in diffusion models with listwise data. Given a caption, we aggregate user feedback into a ranked list of images and derive a listwise extension of the DPO objective under the Plackett-Luce model. Diffusion-LPO enforces consistency across the entire ranking by encouraging each sample to be preferred over all of its lower-ranked alternatives. We empirically demonstrate the effectiveness of Diffusion-LPO across various tasks, including text-to-image generation, image editing, and personalized preference alignment. Diffusion-LPO consistently outperforms pairwise DPO baselines on visual quality and preference alignment.

Paper Structure

This paper contains 52 sections, 37 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Sample images generated from SDXL trained with Diffusion-LPO. Diffusion-LPO generalizes Diffusion-DPO by optimizing the preference under a list of ranked images. After finetuning with Diffusion-LPO, SDXL produces images with higher visual aesthetics and prompt alignments.
  • Figure 2: An example of a ranked list of images under human preference. The caption is "Rocket Raccoon, furry art, fanart, digital painting."
  • Figure 3: Images generated from original SDXL, Diffusion-DPO, DSPO, and Diffusion-LPO. Diffusion-LPO demonstrates improved image generation quality over other baselines regarding general aesthetics and detail handling. The last column indicates the zoom-in parts.
  • Figure 4: Images generated under Diffusion-LPO with personal preference alignment with other baselines. User profiles are summarized by VLM. "SD 1.5+Profile" represents images generated using SD 1.5 with user profile appended to the caption. We highlight the user preferences in green and dispreferences in red.
  • Figure 5: An example of a group with preferences that forms a DAG. The arrow pointing from image ${\mathbf{x}}_{A}$ to image ${\mathbf{x}}_{B}$ represents human preference: ${\mathbf{x}}_{A} \succ {\mathbf{x}}_{B}$. The prompt is "Rocket Raccoon, furry art, fanart, digital painting". Here, valid rank list will be $({\mathbf{x}}_{A} \succ {\mathbf{x}}_{B} \succ {\mathbf{x}}_{C} \succ {\mathbf{x}}_{E})$ and $({\mathbf{x}}_{A} \succ {\mathbf{x}}_{D} \succ {\mathbf{x}}_{E})$.
  • ...and 4 more figures