Table of Contents
Fetching ...

Understanding the Impact of Sampling Quality in Direct Preference Optimization

Kyung Rok Kim, Yumo Bai, Chonghuan Wang, Guanting Chen

TL;DR

This work investigates how data quality and sampling strategies shape Direct Preference Optimization (DPO) training dynamics, revealing how the data-generating distribution’s support constrains learning and can cause likelihood displacement. It establishes that the DPO optimum coincides with a reward-weighted reweighting of the reference policy, but practical learning depends on the observed data support and reference model quality. The authors propose an online linear alignment model to isolate core effects, proving convergence and showing that higher-quality, more frequent high-reward responses amplify gradient signals and improve convergence. Empirical results with on-policy data rewrites and Best-of-$K$ sampling corroborate the theory, demonstrating improved ID/OOD performance and no reward hacking under the proposed framework. Collectively, the results justify online, data-aware DPO and provide a principled foundation for data design in preference-based training of alignment systems.

Abstract

We study how data of higher quality can be leveraged to improve performance in Direct Preference Optimization (DPO), aiming to understand its impact on DPO training dynamics. Our analyses show that both the solution space and the convergence behavior of DPO depend on the support and quality of the data-generating distribution. We first analyze how data and reference policy influence policy updates during gradient descent, and how a practical phenomenon known as likelihood displacement can interfere with the desired dynamics. We then design a simplified yet well-structured alignment model as a proxy that preserves most of the beneficial properties of RLHF while avoiding likelihood displacement. Based on this model, we develop quantitative results showing how more frequent high-quality responses amplify the gradient signal and improve the optimization landscape, leading to more effective policy learning. Our theoretical findings are supported by empirical experiments and provide a principled justification for the online DPO framework in practice.

Understanding the Impact of Sampling Quality in Direct Preference Optimization

TL;DR

This work investigates how data quality and sampling strategies shape Direct Preference Optimization (DPO) training dynamics, revealing how the data-generating distribution’s support constrains learning and can cause likelihood displacement. It establishes that the DPO optimum coincides with a reward-weighted reweighting of the reference policy, but practical learning depends on the observed data support and reference model quality. The authors propose an online linear alignment model to isolate core effects, proving convergence and showing that higher-quality, more frequent high-reward responses amplify gradient signals and improve convergence. Empirical results with on-policy data rewrites and Best-of- sampling corroborate the theory, demonstrating improved ID/OOD performance and no reward hacking under the proposed framework. Collectively, the results justify online, data-aware DPO and provide a principled foundation for data design in preference-based training of alignment systems.

Abstract

We study how data of higher quality can be leveraged to improve performance in Direct Preference Optimization (DPO), aiming to understand its impact on DPO training dynamics. Our analyses show that both the solution space and the convergence behavior of DPO depend on the support and quality of the data-generating distribution. We first analyze how data and reference policy influence policy updates during gradient descent, and how a practical phenomenon known as likelihood displacement can interfere with the desired dynamics. We then design a simplified yet well-structured alignment model as a proxy that preserves most of the beneficial properties of RLHF while avoiding likelihood displacement. Based on this model, we develop quantitative results showing how more frequent high-quality responses amplify the gradient signal and improve the optimization landscape, leading to more effective policy learning. Our theoretical findings are supported by empirical experiments and provide a principled justification for the online DPO framework in practice.

Paper Structure

This paper contains 58 sections, 12 theorems, 125 equations, 9 figures, 3 tables, 2 algorithms.

Key Result

Proposition 1

The function $f^*(x,y) = r^*(x,y) + c(x)$, where $c(x)$ is a function of $x$ only, is a global optimal solution to (DPO_form_f). Consequently, the policy $\pi^*(y|x)$ defined as where $Z(x) = \int \pi_{\text{ref}}(y|x)\exp\left(\frac{1}{\beta}r^*(x,y)\right)dy$, is optimal solution for both the RLHF formulation (RLHF_original) and the DPO formulation (DPO_BT_definition).

Figures (9)

  • Figure 1: Effect of Misaligned Supports
  • Figure 2: $y_{\text{ind}}$ refers to the response that is not semantically related to $y_w$ or $y_l$, and $y_{\text{cor}}$ refers to the response semantically correlated with $y_w$. The left diagram shows the change of probability after applying one step of gradient descent on the entire batch of data set $\{(x^{(i)},y_w^{(i)},y_l^{(i)})\}_{i=1}^N$. The right diagram shows the change of probability after applying one step of gradient descent on the entire batch of data set $\{(x^{(i)},y_w^{(i)},y_l^{(i)})\}_{i=1}^N$.
  • Figure 3: $y_{\text{ind}}$ refers to the response that is significantly different from to $y_w$ or $y_l$, and $y_{\text{cor}}$ refers to the vector close to $y_w$. The left diagram shows the change of probability after repeatedly applying one step of gradient descent on one tuple $(x,y_w,y_l)$ from the entire dataset. The right diagram shows the change of probability after applying one step of gradient descent on the entire batch of data set $\{(x^{(i)},y_w^{(i)},y_l^{(i)})\}_{i=1}^N$.
  • Figure 4: Convergence of Algorithm \ref{['alg:Data']} to the ground-truth weight in the linear alignment model.
  • Figure 5: Sampled responses' reward distribution for different models (ID and OOD).
  • ...and 4 more figures

Theorems & Definitions (27)

  • Proposition 1
  • Lemma 1
  • Theorem 1
  • Lemma 2
  • Proposition 2: Convergence of Online OPO
  • Lemma 3
  • Proposition 3: First and second order properties
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • ...and 17 more