Table of Contents
Fetching ...

Robust Preference Optimization via Dynamic Target Margins

Jie Sun, Junkang Wu, Jiancan Wu, Zhibo Zhu, Xingyu Lu, Jun Zhou, Lintao Ma, Xiang Wang

TL;DR

This work tackles robustness in aligning LLMs under noisy human preferences by introducing γ-PO, a dynamic target-margin method that assigns per-instance margins to preference pairs. By formulating an adaptive, KL-regularized margin optimization, γ-PO integrates with DPO and SimPO as γ-DPO and γ-SimPO, effectively implementing adaptive label smoothing tied to reward gaps. Empirical results across multiple base models and benchmarks show a 4.4% average improvement over baselines with minimal training overhead, validating the method’s plug-and-play practicality. The approach addresses data quality issues in RLHF pipelines and offers a scalable path for more robust, human-aligned LLM behavior.

Abstract

The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose $γ$-PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, $γ$-PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, $γ$-PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, $γ$-PO achieves an average 4.4\% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, $γ$-PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at \href{https://github.com/sunjie279/gammaPO}{https://github.com/sunjie279/gammaPO}.

Robust Preference Optimization via Dynamic Target Margins

TL;DR

This work tackles robustness in aligning LLMs under noisy human preferences by introducing γ-PO, a dynamic target-margin method that assigns per-instance margins to preference pairs. By formulating an adaptive, KL-regularized margin optimization, γ-PO integrates with DPO and SimPO as γ-DPO and γ-SimPO, effectively implementing adaptive label smoothing tied to reward gaps. Empirical results across multiple base models and benchmarks show a 4.4% average improvement over baselines with minimal training overhead, validating the method’s plug-and-play practicality. The approach addresses data quality issues in RLHF pipelines and offers a scalable path for more robust, human-aligned LLM behavior.

Abstract

The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose -PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, -PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, -PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, -PO achieves an average 4.4\% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, -PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at \href{https://github.com/sunjie279/gammaPO}{https://github.com/sunjie279/gammaPO}.

Paper Structure

This paper contains 27 sections, 2 theorems, 25 equations, 6 figures, 10 tables, 2 algorithms.

Key Result

Theorem 3.1

Let $\delta = \gamma_i - \gamma_0$ and $m = r_w - r_l$. When $|\delta| \ll |m|$, equating $\mathcal{L}_\text{rDPO}$ and $\mathcal{L}_\text{$\gamma$-PO}$ yields the approximation:

Figures (6)

  • Figure 1: Comparison of ambiguous and unambiguous sample pairs. Ambiguous pairs exhibit narrow reward margins, indicating low confidence in model predictions, whereas unambiguous pairs demonstrate wide reward margins, reflecting high prediction confidence.
  • Figure 2: Distribution of the reward margin for the Mistral model using the SimPO objective on the Ultrafeedback Binarized dataset. The blue violin plot represents the full range of the original distribution, while the brown plot provides a zoomed-in view of the distribution, focusing on the central values close to zero.
  • Figure 3: The Dynamic Target Margin Module dynamically adjusts the adaptive target margin ($\gamma_i$) through reward-driven optimization, guided by the dual mechanisms described in Section \ref{['sec:guidances']}. The optimized $\gamma_i$ subsequently replaces the static margin ($\gamma_0$) in the policy optimization loss function, enabling adaptive margin adjustment throughout the training process.
  • Figure 4: Comparison of average winning rates with randomly flipped labels at different probabilities.
  • Figure 5: Visualization of dynamic target margin ($\gamma_i$) with reward gaps. The horizontal line indicates the initial value of target margin ($\gamma_0$).
  • ...and 1 more figures

Theorems & Definitions (3)

  • Theorem 3.1
  • Theorem 3.1
  • proof