Table of Contents
Fetching ...

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

Ruizhe Shi, Minhak Song, Runlong Zhou, Zihan Zhang, Maryam Fazel, Simon S. Du

TL;DR

This work addresses the performance gap between RLHF and DPO under representation gaps between reward and policy classes. It develops a fine-grained theory for exact optimization and for finite-sample settings, introducing a taxonomy of mis-specification scenarios and the PILAF online sampler, plus token-level insights that reveal how reward and policy errors interact. A concrete DTSP construction demonstrates a data-efficiency advantage for RLHF in sparse-reward settings, with formal rates distinguishing reward-learning versus surrogate-learning regimes. Empirical verifications on PKU-SafeRLHF corroborate the theoretical predictions and illustrate practical guidance on when to favor RLHF or DPO depending on model capacity and data availability. Overall, the paper provides a nuanced framework linking representational capacity, sampling efficiency, and optimization dynamics to the RLHF-versus-DPO choice in preference-based policy learning.

Abstract

We present a fine-grained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap. Our study decomposes this gap into two sources: an explicit representation gap under exact optimization and an implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is implicitly sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model -- highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

TL;DR

This work addresses the performance gap between RLHF and DPO under representation gaps between reward and policy classes. It develops a fine-grained theory for exact optimization and for finite-sample settings, introducing a taxonomy of mis-specification scenarios and the PILAF online sampler, plus token-level insights that reveal how reward and policy errors interact. A concrete DTSP construction demonstrates a data-efficiency advantage for RLHF in sparse-reward settings, with formal rates distinguishing reward-learning versus surrogate-learning regimes. Empirical verifications on PKU-SafeRLHF corroborate the theoretical predictions and illustrate practical guidance on when to favor RLHF or DPO depending on model capacity and data availability. Overall, the paper provides a nuanced framework linking representational capacity, sampling efficiency, and optimization dynamics to the RLHF-versus-DPO choice in preference-based policy learning.

Abstract

We present a fine-grained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap. Our study decomposes this gap into two sources: an explicit representation gap under exact optimization and an implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is implicitly sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model -- highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.

Paper Structure

This paper contains 30 sections, 16 theorems, 129 equations, 5 figures.

Key Result

Proposition 1

Under condition:ss, $V_{r^\star}^{\pi_\textup{RLHF}}= V_{r^\star}^{\pi_\textup{DPO}} = V_\Pi^\star$.

Figures (5)

  • Figure 1: Main results on performance gap induced by model mis-specification scenarios.
  • Figure 2: Experimental Results for \ref{['condition:ss']}. Experiments with different reward scales $\{0.4, 1, 4\}$ align with \ref{['thm:approximation']}: as the reward scale increases, the second-order deviation in the online DPO objective grows, giving RLHF a clear advantage.
  • Figure 3: Experimental Results for \ref{['condition:sw', 'condition:ws', 'condition:ww']}. The first two plots (\ref{['condition:sw', 'condition:ws']}) are consistent with \ref{['prop:sw', 'prop:ws']}. The gap in the last plot can be attributed to the mis-specified reward model being too weak.
  • Figure 4: Experimental Results on Statistical Efficiency. We experiment on two preference types. Pure reward learning is shown to be more data-efficient than surrogate reward learning.
  • Figure 5: Numerically Computed Curves of Gradient Functions and Value Functions.

Theorems & Definitions (26)

  • Proposition 1
  • Definition 1: PILAF Sampler shi2025thefeng2025pilaf
  • Remark 1
  • Theorem 2
  • Remark 2
  • Proposition 3
  • Proposition 4
  • Remark 3
  • Proposition 5
  • Remark 4
  • ...and 16 more