Table of Contents
Fetching ...

SEE-DPO: Self Entropy Enhanced Direct Preference Optimization

Shivanshu Shekhar, Shreyas Singh, Tong Zhang

TL;DR

This work addresses reward hacking and limited diversity in RLHF-aligned diffusion models by introducing self-entropy regularization into the Direct Preference Optimization framework. The SEE-DPO approach flattens the reference distribution via a gamma-weighted entropy term, promoting exploration and robustness across the latent space while preserving compatibility with existing DPO variants. The authors provide a unified theoretical view linking D3PO, SPO, and Diffusion-DPO under the augmented objective, and validate improvements across image quality and diversity metrics, including user studies. The results demonstrate state-of-the-art performance and greater output diversity, with implications for more reliable and expressive diffusion-based generation in practical settings.

Abstract

Direct Preference Optimization (DPO) has been successfully used to align large language models (LLMs) according to human preferences, and more recently it has also been applied to improving the quality of text-to-image diffusion models. However, DPO-based methods such as SPO, Diffusion-DPO, and D3PO are highly susceptible to overfitting and reward hacking, especially when the generative model is optimized to fit out-of-distribution during prolonged training. To overcome these challenges and stabilize the training of diffusion models, we introduce a self-entropy regularization mechanism in reinforcement learning from human feedback. This enhancement improves DPO training by encouraging broader exploration and greater robustness. Our regularization technique effectively mitigates reward hacking, leading to improved stability and enhanced image quality across the latent space. Extensive experiments demonstrate that integrating human feedback with self-entropy regularization can significantly boost image diversity and specificity, achieving state-of-the-art results on key image generation metrics.

SEE-DPO: Self Entropy Enhanced Direct Preference Optimization

TL;DR

This work addresses reward hacking and limited diversity in RLHF-aligned diffusion models by introducing self-entropy regularization into the Direct Preference Optimization framework. The SEE-DPO approach flattens the reference distribution via a gamma-weighted entropy term, promoting exploration and robustness across the latent space while preserving compatibility with existing DPO variants. The authors provide a unified theoretical view linking D3PO, SPO, and Diffusion-DPO under the augmented objective, and validate improvements across image quality and diversity metrics, including user studies. The results demonstrate state-of-the-art performance and greater output diversity, with implications for more reliable and expressive diffusion-based generation in practical settings.

Abstract

Direct Preference Optimization (DPO) has been successfully used to align large language models (LLMs) according to human preferences, and more recently it has also been applied to improving the quality of text-to-image diffusion models. However, DPO-based methods such as SPO, Diffusion-DPO, and D3PO are highly susceptible to overfitting and reward hacking, especially when the generative model is optimized to fit out-of-distribution during prolonged training. To overcome these challenges and stabilize the training of diffusion models, we introduce a self-entropy regularization mechanism in reinforcement learning from human feedback. This enhancement improves DPO training by encouraging broader exploration and greater robustness. Our regularization technique effectively mitigates reward hacking, leading to improved stability and enhanced image quality across the latent space. Extensive experiments demonstrate that integrating human feedback with self-entropy regularization can significantly boost image diversity and specificity, achieving state-of-the-art results on key image generation metrics.

Paper Structure

This paper contains 14 sections, 32 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Effect of Entropy Regularization on reward hacking. This figure demonstrates the effectiveness of entropy regularization for mitigating reward hacking problems during Direct Preference Optimization (DPO) dpovsppo. The A images are distorted whereas the B images are aesthetically better at the same epochs.
  • Figure 2: Accuracy vs Steps A: Step-Wise PickScore spo, B: Aesthetic Scoreaes_score, C: PickScore pickapic, demonstrate faster convergence, improved training stability and robustness to reward hacking.
  • Figure 3: User study. We compare our best models with counterparts fine-tuned using the PickScore reward model pickapic. Following SPO spo, we randomly select 300 prompts from PartiPrompts parti and HPS hpsv2 in a 1:2 ratio. Participants are shown a reference image alongside three generated images from the same prompt but with different initial latents. They evaluate image quality and diversity relative to the reference to estimate the diversity metric. For other metrics, users assess single images.
  • Figure 4: Enhancing Generation Diversity. Our method demonstrates superior image quality, producing clearer and more detailed results across different latent variables. In contrast, existing approaches show significant degradation in image quality, with blurring and distortions becoming evident as latent variations increase.
  • Figure 5: Ablation Study for $\gamma$ & $\beta$: This study investigates the impact of $\gamma$ and $\beta$ on image quality, specifically using PickScore as the metric. In the first ablation, we examine the effect of varying $\gamma$ while keeping all other parameters fixed, with $\beta$ set to 0.1. In the second ablation, we explore the influence of different values of $\beta$ for $\gamma=3$ settings.
  • ...and 2 more figures