Understanding the Impact of Sampling Quality in Direct Preference Optimization
Kyung Rok Kim, Yumo Bai, Chonghuan Wang, Guanting Chen
TL;DR
This work investigates how data quality and sampling strategies shape Direct Preference Optimization (DPO) training dynamics, revealing how the data-generating distribution’s support constrains learning and can cause likelihood displacement. It establishes that the DPO optimum coincides with a reward-weighted reweighting of the reference policy, but practical learning depends on the observed data support and reference model quality. The authors propose an online linear alignment model to isolate core effects, proving convergence and showing that higher-quality, more frequent high-reward responses amplify gradient signals and improve convergence. Empirical results with on-policy data rewrites and Best-of-$K$ sampling corroborate the theory, demonstrating improved ID/OOD performance and no reward hacking under the proposed framework. Collectively, the results justify online, data-aware DPO and provide a principled foundation for data design in preference-based training of alignment systems.
Abstract
We study how data of higher quality can be leveraged to improve performance in Direct Preference Optimization (DPO), aiming to understand its impact on DPO training dynamics. Our analyses show that both the solution space and the convergence behavior of DPO depend on the support and quality of the data-generating distribution. We first analyze how data and reference policy influence policy updates during gradient descent, and how a practical phenomenon known as likelihood displacement can interfere with the desired dynamics. We then design a simplified yet well-structured alignment model as a proxy that preserves most of the beneficial properties of RLHF while avoiding likelihood displacement. Based on this model, we develop quantitative results showing how more frequent high-quality responses amplify the gradient signal and improve the optimization landscape, leading to more effective policy learning. Our theoretical findings are supported by empirical experiments and provide a principled justification for the online DPO framework in practice.
