Table of Contents
Fetching ...

Mitigating Mismatch within Reference-based Preference Optimization

Suqin Yuan, Xingrui Yu, Jiyang Zheng, Lei Feng, Dadong Wang, Ivor Tsang, Tongliang Liu

TL;DR

This work tackles the training–inference mismatch in Direct Preference Optimization (DPO) caused by reliance on a reference policy, which can cause premature gradient attenuation for pessimistic pairs. It introduces Hybrid-DPO (HyPO), a plug-in modification that conditionally clips the reference margin with $\tilde{\Delta}_{ref}=\max\{0,\Delta_{ref}\}$ (or a softplus variant), preserving DPO’s structure when the reference is helpful and switching to an absolute-margin update when the reference is pessimistic. Empirically, HyPO delivers substantial improvements across base and instruction-tuned LLMs on AlpacaEval 2.0 and Arena-Hard benchmarks, achieving a 41.2% average relative improvement over DPO and robust performance across scaling and dataset shifts. The results suggest that conditional debiasing of the reference signal provides a principled, practical path to stronger, more stable direct preference alignment without incurring extra computational costs.

Abstract

Direct Preference Optimization (DPO) has become the de facto standard for offline preference alignment of large language models, but its reliance on a reference policy introduces a critical tension. DPO weighs each update relative to a reference, which stabilizes the training by regularizing the updates within a trusted region. This reliance becomes problematic for pessimistic pairs, where the reference model prefers the rejected response. For these pairs, DPO prematurely attenuates the gradient as soon as the policy margin ($Δ_θ$) merely beats the reference margin ($Δ_{\mathrm{ref}}$) even if the policy is still wrong ($Δ_θ<0$). We name this failure premature satisfaction, which is a concrete form of the training-inference mismatch. Reference-free objectives remove this mismatch by optimizing the absolute margin, but at the cost of discarding the stabilizing signal of the reference. We mitigate this tension with Hybrid-DPO (HyPO), a drop-in modification to DPO that applies reference conditionally: HyPO behaves exactly like DPO when the reference is optimistic or neutral, and it treats the reference as neutral when it is pessimistic by replacing $Δ_θ-Δ_{\mathrm{ref}}$ with $Δ_θ-\max\{0,Δ_{\mathrm{ref}}\}$. This one-line change strictly strengthens per-example learning signals on pessimistic pairs while preserving DPO's objective form and computational cost. By conditionally debiasing the pessimistic reference signal, HyPO mitigates premature satisfaction; empirically, across preference alignment, HyPO improves inference-aligned metrics and achieves higher pairwise win rates. Our results provide evidence that direct preference alignment could be enhanced by conditionally debiasing the reference signal, rather than discarding it.

Mitigating Mismatch within Reference-based Preference Optimization

TL;DR

This work tackles the training–inference mismatch in Direct Preference Optimization (DPO) caused by reliance on a reference policy, which can cause premature gradient attenuation for pessimistic pairs. It introduces Hybrid-DPO (HyPO), a plug-in modification that conditionally clips the reference margin with (or a softplus variant), preserving DPO’s structure when the reference is helpful and switching to an absolute-margin update when the reference is pessimistic. Empirically, HyPO delivers substantial improvements across base and instruction-tuned LLMs on AlpacaEval 2.0 and Arena-Hard benchmarks, achieving a 41.2% average relative improvement over DPO and robust performance across scaling and dataset shifts. The results suggest that conditional debiasing of the reference signal provides a principled, practical path to stronger, more stable direct preference alignment without incurring extra computational costs.

Abstract

Direct Preference Optimization (DPO) has become the de facto standard for offline preference alignment of large language models, but its reliance on a reference policy introduces a critical tension. DPO weighs each update relative to a reference, which stabilizes the training by regularizing the updates within a trusted region. This reliance becomes problematic for pessimistic pairs, where the reference model prefers the rejected response. For these pairs, DPO prematurely attenuates the gradient as soon as the policy margin () merely beats the reference margin () even if the policy is still wrong (). We name this failure premature satisfaction, which is a concrete form of the training-inference mismatch. Reference-free objectives remove this mismatch by optimizing the absolute margin, but at the cost of discarding the stabilizing signal of the reference. We mitigate this tension with Hybrid-DPO (HyPO), a drop-in modification to DPO that applies reference conditionally: HyPO behaves exactly like DPO when the reference is optimistic or neutral, and it treats the reference as neutral when it is pessimistic by replacing with . This one-line change strictly strengthens per-example learning signals on pessimistic pairs while preserving DPO's objective form and computational cost. By conditionally debiasing the pessimistic reference signal, HyPO mitigates premature satisfaction; empirically, across preference alignment, HyPO improves inference-aligned metrics and achieves higher pairwise win rates. Our results provide evidence that direct preference alignment could be enhanced by conditionally debiasing the reference signal, rather than discarding it.
Paper Structure (23 sections, 20 equations, 5 figures, 6 tables)

This paper contains 23 sections, 20 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Per-example gradient weight heatmaps. The weight, plotted over the policy margin $\Delta_\theta$ ($x$-axis) and the reference margin $\Delta_{\mathrm{ref}}$ ($y$-axis), indicates training signal strength. DPO’s reliance on the relative margin leads to premature satisfaction: on pessimistic examples (blue dot, $\Delta_{\mathrm{ref}}<0$), the signal is heavily attenuated even though the policy is still wrong in absolute terms. Reference-free methods provide a strong signal but discard the reference entirely. HyPOmitigates this by mirroring the reference-free behavior on pessimistic examples to ensure a strong signal, while reverting to DPO on optimistic examples (red dot) to maintain proximity to the reference policy.
  • Figure 2: Distribution of the reference margin ($\Delta_{\mathrm{ref}}$) across different reference models. The table reports the mean and median (p50) of $\Delta_{\mathrm{ref}}$ for each model.
  • Figure 2: Ablation study of HyPO's components.
  • Figure 3: HyPO improves inference-aligned evaluation metrics and pairwise win rates. (a) Absolute agreement rate over training (higher is better). (b) Absolute margin on the pessimistic subset ($\Delta_{\mathrm{ref}}<0$). (c) Pairwise win-rate. Each cell is the win rate (%) of the row model against the column model on AlpacaEval 2.0 alpaca_eval. All results use the same SFT checkpoint of Llama-3-8B-Base llama3modelcard trained on UltraFeedback cui2023ultrafeedback with either DPO or our HyPO; the training/evaluation pipeline and optimization hyperparameters are identical and set to the DPO configuration from Zephyr tunstall2023zephyr. See Section \ref{['settings']} for more settings.
  • Figure 4: Sensitivity to the threshold $\gamma$, using Meta-Llama-3-8B-Instruct.