The Crucial Role of Samplers in Online Direct Preference Optimization

Ruizhe Shi; Runlong Zhou; Simon S. Du

The Crucial Role of Samplers in Online Direct Preference Optimization

Ruizhe Shi, Runlong Zhou, Simon S. Du

TL;DR

This paper analyzes how sampler design influences the convergence of Direct Preference Optimization (DPO) for language model alignment. It proves a theoretical separation: under exact gradient updates, mixed samplers (DPO-Mix-R and DPO-Mix-P) achieve quadratic convergence, while uniform sampling (DPO-Unif) yields linear convergence; in empirical settings, convergence to the noise floor remains linear. To bridge theory and practice, the authors introduce posterior-based sampling and logit mixing, mapping existing DPO variants to their framework and demonstrating practical gains on Safe-RLHF and Iterative-Prompt datasets. The work offers both a deeper optimization-theoretic understanding of DPO and concrete sampler-design principles that enhance online RLHF-style alignment, guiding future algorithm development. Overall, the results highlight the critical role of sampler design in achieving fast, stable, and scalable alignment of language models with human preferences.

Abstract

Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment. Despite its empirical success, the optimization properties, particularly the impact of samplers on its convergence rates, remain under-explored. In this paper, we provide a rigorous analysis of DPO's convergence rates with different sampling strategies under the exact gradient setting, revealing a surprising separation: uniform sampling achieves $\textbf{linear}$ convergence, while our proposed online sampler achieves $\textbf{quadratic}$ convergence. We further adapt the sampler to practical settings by incorporating posterior distributions and logit mixing, demonstrating improvements over previous methods. For example, it outperforms vanilla DPO by over $7.4$% on Safe-RLHF dataset. Our results not only offer insights into the theoretical understanding of DPO but also pave the way for further algorithm designs.

The Crucial Role of Samplers in Online Direct Preference Optimization

TL;DR

Abstract

convergence, while our proposed online sampler achieves

convergence. We further adapt the sampler to practical settings by incorporating posterior distributions and logit mixing, demonstrating improvements over previous methods. For example, it outperforms vanilla DPO by over

% on Safe-RLHF dataset. Our results not only offer insights into the theoretical understanding of DPO but also pave the way for further algorithm designs.

Paper Structure (56 sections, 8 theorems, 93 equations, 6 figures, 5 tables)

This paper contains 56 sections, 8 theorems, 93 equations, 6 figures, 5 tables.

Introduction
Related Work
Theoretical study of RLHF/DPO.
Variants of DPO.
Other RLHF approaches.
Convergence analysis of policy gradient methods.
Preliminaries
Standard Bandit Learning
Multi-armed bandits and contextual bandits.
Policies.
Reinforcement Learning from Human Feedback (RLHF)
Bradley-Terry (BT) model.
RLHF Ziegler2019FineTuningLMBai2022TrainingAH.
Direct Preference Optimization (DPO, rafailov2023direct).
Main Results
...and 41 more sections

Key Result

Theorem 1

Under condition, DPO-Unif satisfies where $T\in\mathbb N$ is the number of iterations.

Figures (6)

Figure 1: Contextual bandit experiments for exact DPO and empirical DPO. The $x$-axis is the number of gradient updates, and the $y$-axis is the total parameter difference $\sum_{y, y'} \vert\delta (y, y'; \theta^{(t)})\vert$. The left figure illustrates exact DPO, and the right figure illustrates empirical DPO. For exact DPO, the lower bound is due to the precision of floating numbers. For empirical DPO, the lower bound is due to sampling variances. The separation is clear in exact DPO, and still exists in empirical DPO.
Figure 2: The Reward-KL curves. The left figure illustrates results on Safe-RLHF, and the right one illustrates results on Iterative-Prompt. Th KL-divergence is measured on a subset of prompts in the test set. The results indicate that the KL-divergence of trained models does not deviate much from the reference model, and our method performs best in balancing reward and KL-divergence.
Figure 3: More bandit experiments for exact DPO.
Figure 4: More bandit experiments for empirical DPO.
Figure 5: Ablation on components of mixed samplers for DPO-Mix-R.
...and 1 more figures

Theorems & Definitions (19)

Definition 1: Exact DPO
Remark 1
Definition 2: Empirical DPO
Remark 2
Definition 3: Uniform sampler
Definition 4: Reward-guided mixed sampler
Remark 3
Definition 5: Policy-difference-guided mixed sampler
Remark 4
Theorem 1: Upper bound of DPO-Unif
...and 9 more

The Crucial Role of Samplers in Online Direct Preference Optimization

TL;DR

Abstract

The Crucial Role of Samplers in Online Direct Preference Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (19)