The Crucial Role of Samplers in Online Direct Preference Optimization
Ruizhe Shi, Runlong Zhou, Simon S. Du
TL;DR
This paper analyzes how sampler design influences the convergence of Direct Preference Optimization (DPO) for language model alignment. It proves a theoretical separation: under exact gradient updates, mixed samplers (DPO-Mix-R and DPO-Mix-P) achieve quadratic convergence, while uniform sampling (DPO-Unif) yields linear convergence; in empirical settings, convergence to the noise floor remains linear. To bridge theory and practice, the authors introduce posterior-based sampling and logit mixing, mapping existing DPO variants to their framework and demonstrating practical gains on Safe-RLHF and Iterative-Prompt datasets. The work offers both a deeper optimization-theoretic understanding of DPO and concrete sampler-design principles that enhance online RLHF-style alignment, guiding future algorithm development. Overall, the results highlight the critical role of sampler design in achieving fast, stable, and scalable alignment of language models with human preferences.
Abstract
Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment. Despite its empirical success, the optimization properties, particularly the impact of samplers on its convergence rates, remain under-explored. In this paper, we provide a rigorous analysis of DPO's convergence rates with different sampling strategies under the exact gradient setting, revealing a surprising separation: uniform sampling achieves $\textbf{linear}$ convergence, while our proposed online sampler achieves $\textbf{quadratic}$ convergence. We further adapt the sampler to practical settings by incorporating posterior distributions and logit mixing, demonstrating improvements over previous methods. For example, it outperforms vanilla DPO by over $7.4$% on Safe-RLHF dataset. Our results not only offer insights into the theoretical understanding of DPO but also pave the way for further algorithm designs.
