Table of Contents
Fetching ...

The Crucial Role of Samplers in Online Direct Preference Optimization

Ruizhe Shi, Runlong Zhou, Simon S. Du

TL;DR

This paper analyzes how sampler design influences the convergence of Direct Preference Optimization (DPO) for language model alignment. It proves a theoretical separation: under exact gradient updates, mixed samplers (DPO-Mix-R and DPO-Mix-P) achieve quadratic convergence, while uniform sampling (DPO-Unif) yields linear convergence; in empirical settings, convergence to the noise floor remains linear. To bridge theory and practice, the authors introduce posterior-based sampling and logit mixing, mapping existing DPO variants to their framework and demonstrating practical gains on Safe-RLHF and Iterative-Prompt datasets. The work offers both a deeper optimization-theoretic understanding of DPO and concrete sampler-design principles that enhance online RLHF-style alignment, guiding future algorithm development. Overall, the results highlight the critical role of sampler design in achieving fast, stable, and scalable alignment of language models with human preferences.

Abstract

Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment. Despite its empirical success, the optimization properties, particularly the impact of samplers on its convergence rates, remain under-explored. In this paper, we provide a rigorous analysis of DPO's convergence rates with different sampling strategies under the exact gradient setting, revealing a surprising separation: uniform sampling achieves $\textbf{linear}$ convergence, while our proposed online sampler achieves $\textbf{quadratic}$ convergence. We further adapt the sampler to practical settings by incorporating posterior distributions and logit mixing, demonstrating improvements over previous methods. For example, it outperforms vanilla DPO by over $7.4$% on Safe-RLHF dataset. Our results not only offer insights into the theoretical understanding of DPO but also pave the way for further algorithm designs.

The Crucial Role of Samplers in Online Direct Preference Optimization

TL;DR

This paper analyzes how sampler design influences the convergence of Direct Preference Optimization (DPO) for language model alignment. It proves a theoretical separation: under exact gradient updates, mixed samplers (DPO-Mix-R and DPO-Mix-P) achieve quadratic convergence, while uniform sampling (DPO-Unif) yields linear convergence; in empirical settings, convergence to the noise floor remains linear. To bridge theory and practice, the authors introduce posterior-based sampling and logit mixing, mapping existing DPO variants to their framework and demonstrating practical gains on Safe-RLHF and Iterative-Prompt datasets. The work offers both a deeper optimization-theoretic understanding of DPO and concrete sampler-design principles that enhance online RLHF-style alignment, guiding future algorithm development. Overall, the results highlight the critical role of sampler design in achieving fast, stable, and scalable alignment of language models with human preferences.

Abstract

Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment. Despite its empirical success, the optimization properties, particularly the impact of samplers on its convergence rates, remain under-explored. In this paper, we provide a rigorous analysis of DPO's convergence rates with different sampling strategies under the exact gradient setting, revealing a surprising separation: uniform sampling achieves convergence, while our proposed online sampler achieves convergence. We further adapt the sampler to practical settings by incorporating posterior distributions and logit mixing, demonstrating improvements over previous methods. For example, it outperforms vanilla DPO by over % on Safe-RLHF dataset. Our results not only offer insights into the theoretical understanding of DPO but also pave the way for further algorithm designs.
Paper Structure (56 sections, 8 theorems, 93 equations, 6 figures, 5 tables)

This paper contains 56 sections, 8 theorems, 93 equations, 6 figures, 5 tables.

Key Result

Theorem 1

Under condition, DPO-Unif satisfies where $T\in\mathbb N$ is the number of iterations.

Figures (6)

  • Figure 1: Contextual bandit experiments for exact DPO and empirical DPO. The $x$-axis is the number of gradient updates, and the $y$-axis is the total parameter difference $\sum_{y, y'} \vert\delta (y, y'; \theta^{(t)})\vert$. The left figure illustrates exact DPO, and the right figure illustrates empirical DPO. For exact DPO, the lower bound is due to the precision of floating numbers. For empirical DPO, the lower bound is due to sampling variances. The separation is clear in exact DPO, and still exists in empirical DPO.
  • Figure 2: The Reward-KL curves. The left figure illustrates results on Safe-RLHF, and the right one illustrates results on Iterative-Prompt. Th KL-divergence is measured on a subset of prompts in the test set. The results indicate that the KL-divergence of trained models does not deviate much from the reference model, and our method performs best in balancing reward and KL-divergence.
  • Figure 3: More bandit experiments for exact DPO.
  • Figure 4: More bandit experiments for empirical DPO.
  • Figure 5: Ablation on components of mixed samplers for DPO-Mix-R.
  • ...and 1 more figures

Theorems & Definitions (19)

  • Definition 1: Exact DPO
  • Remark 1
  • Definition 2: Empirical DPO
  • Remark 2
  • Definition 3: Uniform sampler
  • Definition 4: Reward-guided mixed sampler
  • Remark 3
  • Definition 5: Policy-difference-guided mixed sampler
  • Remark 4
  • Theorem 1: Upper bound of DPO-Unif
  • ...and 9 more