Table of Contents
Fetching ...

Diffusion-SDPO: Safeguarded Direct Preference Optimization for Diffusion Models

Minghao Fu, Guo-Hua Wang, Tianyu Cui, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang

TL;DR

Diffusion-SDPO tackles the instability and potential quality drop seen when increasing the preference margin in diffusion-based DPO. It introduces a winner-preserving update that adaptively scales the loser gradient using a first-order safeguard, expressed in both parameter and output spaces, to guarantee the preferred output's loss does not increase. The approach is model-agnostic and acts as a plug-in to existing DPO variants, yielding consistent improvements across SD 1.5, SDXL, and Ovis-U1 on automated reward, aesthetic, and alignment metrics with minimal overhead. By clarifying the distinction between relative preference alignment and absolute generation quality, the work offers a practical, scalable path to robust human-aligned diffusion synthesis.

Abstract

Text-to-image diffusion models deliver high-quality images, yet aligning them with human preferences remains challenging. We revisit diffusion-based Direct Preference Optimization (DPO) for these models and identify a critical pathology: enlarging the preference margin does not necessarily improve generation quality. In particular, the standard Diffusion-DPO objective can increase the reconstruction error of both winner and loser branches. Consequently, degradation of the less-preferred outputs can become sufficiently severe that the preferred branch is also adversely affected even as the margin grows. To address this, we introduce Diffusion-SDPO, a safeguarded update rule that preserves the winner by adaptively scaling the loser gradient according to its alignment with the winner gradient. A first-order analysis yields a closed-form scaling coefficient that guarantees the error of the preferred output is non-increasing at each optimization step. Our method is simple, model-agnostic, broadly compatible with existing DPO-style alignment frameworks and adds only marginal computational overhead. Across standard text-to-image benchmarks, Diffusion-SDPO delivers consistent gains over preference-learning baselines on automated preference, aesthetic, and prompt alignment metrics. Code is publicly available at https://github.com/AIDC-AI/Diffusion-SDPO.

Diffusion-SDPO: Safeguarded Direct Preference Optimization for Diffusion Models

TL;DR

Diffusion-SDPO tackles the instability and potential quality drop seen when increasing the preference margin in diffusion-based DPO. It introduces a winner-preserving update that adaptively scales the loser gradient using a first-order safeguard, expressed in both parameter and output spaces, to guarantee the preferred output's loss does not increase. The approach is model-agnostic and acts as a plug-in to existing DPO variants, yielding consistent improvements across SD 1.5, SDXL, and Ovis-U1 on automated reward, aesthetic, and alignment metrics with minimal overhead. By clarifying the distinction between relative preference alignment and absolute generation quality, the work offers a practical, scalable path to robust human-aligned diffusion synthesis.

Abstract

Text-to-image diffusion models deliver high-quality images, yet aligning them with human preferences remains challenging. We revisit diffusion-based Direct Preference Optimization (DPO) for these models and identify a critical pathology: enlarging the preference margin does not necessarily improve generation quality. In particular, the standard Diffusion-DPO objective can increase the reconstruction error of both winner and loser branches. Consequently, degradation of the less-preferred outputs can become sufficiently severe that the preferred branch is also adversely affected even as the margin grows. To address this, we introduce Diffusion-SDPO, a safeguarded update rule that preserves the winner by adaptively scaling the loser gradient according to its alignment with the winner gradient. A first-order analysis yields a closed-form scaling coefficient that guarantees the error of the preferred output is non-increasing at each optimization step. Our method is simple, model-agnostic, broadly compatible with existing DPO-style alignment frameworks and adds only marginal computational overhead. Across standard text-to-image benchmarks, Diffusion-SDPO delivers consistent gains over preference-learning baselines on automated preference, aesthetic, and prompt alignment metrics. Code is publicly available at https://github.com/AIDC-AI/Diffusion-SDPO.

Paper Structure

This paper contains 31 sections, 25 equations, 9 figures, 9 tables, 1 algorithm.

Figures (9)

  • Figure 1: Training dynamics of preference losses during DPO finetuning without (left) and with (right) our safe-$\lambda$ mechanism on SD 1.5 sd15. Images beneath the plots illustrate samples generated at training steps $\{0,500,1000,1500,2000\}$.
  • Figure 2: Training dynamics of $\lambda_{\text{safe}}$ on SD 1.5 (left) and Ovis-U1 (right) with two computation schemes (using output-space gradients vs. parameter-space gradients). The trajectories closely match throughout training, and the output-space variant requires substantially less computation while maintaining comparable aesthetic rewards (see Table \ref{['tab:unified_wpr']}).
  • Figure 3: Training dynamics across three objectives with and without SDPO on SD 1.5.
  • Figure 4: Sensitivity of SDPO to hyperparameter $\mu$ measured by HPS V2 and PickScore across SD 1.5 and SDXL on HPS V2 prompt set.
  • Figure 5: Qualitative comparison of different methods using SD 1.5. Prompt: 1) The Little Prince and the fox in a Tim Burton style artwork. 2) A futuristic modern house on a floating rock island surrounded by waterfalls, moons, and stars on an alien planet. See Fig. \ref{['fig:images_sd15_appendix']} for more results.
  • ...and 4 more figures