Table of Contents
Fetching ...

On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization

Yong Lin, Skyler Seto, Maartje ter Hoeve, Katherine Metcalf, Barry-John Theobald, Xuan Wang, Yizhe Zhang, Chen Huang, Tong Zhang

TL;DR

The paper systematically contrasts explicit reward models (EXRM) trained via RLHF with DPORM, the implicit reward arising from Direct Preference Optimization, to evaluate generalization under distribution shifts. Across multiple datasets, tasks, and model sizes, EXRM generally outperforms DPORM on out-of-distribution data, even when ID performance is similar, indicating DPORM’s limited generalization. Iterative DPO experiments further show that incorporating EXRM yields more robust alignment than relying on DPORM alone. The findings argue for incorporating explicit reward modeling in iterative DPO workflows to achieve stronger and more reliable LLM alignment under distributional shifts.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a reward model are 1) training an EXplicit Reward Model (EXRM) as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Prior work has shown that the implicit reward model of DPO (denoted as DPORM) can approximate an EXRM in the limit. DPORM's effectiveness directly implies the optimality of the learned policy, and also has practical implication for LLM alignment methods including iterative DPO. However, it is unclear how well DPORM empirically matches the performance of EXRM. This work studies the accuracy at distinguishing preferred and rejected answers for both DPORM and EXRM. Our findings indicate that even though DPORM fits the training dataset comparably, it generalizes less effectively than EXRM, especially when the validation datasets contain distribution shifts. Across five out-of-distribution settings, DPORM has a mean drop in accuracy of 3% and a maximum drop of 7%. These findings highlight that DPORM has limited generalization ability and substantiates the integration of an explicit reward model in iterative DPO approaches.

On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization

TL;DR

The paper systematically contrasts explicit reward models (EXRM) trained via RLHF with DPORM, the implicit reward arising from Direct Preference Optimization, to evaluate generalization under distribution shifts. Across multiple datasets, tasks, and model sizes, EXRM generally outperforms DPORM on out-of-distribution data, even when ID performance is similar, indicating DPORM’s limited generalization. Iterative DPO experiments further show that incorporating EXRM yields more robust alignment than relying on DPORM alone. The findings argue for incorporating explicit reward modeling in iterative DPO workflows to achieve stronger and more reliable LLM alignment under distributional shifts.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a reward model are 1) training an EXplicit Reward Model (EXRM) as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Prior work has shown that the implicit reward model of DPO (denoted as DPORM) can approximate an EXRM in the limit. DPORM's effectiveness directly implies the optimality of the learned policy, and also has practical implication for LLM alignment methods including iterative DPO. However, it is unclear how well DPORM empirically matches the performance of EXRM. This work studies the accuracy at distinguishing preferred and rejected answers for both DPORM and EXRM. Our findings indicate that even though DPORM fits the training dataset comparably, it generalizes less effectively than EXRM, especially when the validation datasets contain distribution shifts. Across five out-of-distribution settings, DPORM has a mean drop in accuracy of 3% and a maximum drop of 7%. These findings highlight that DPORM has limited generalization ability and substantiates the integration of an explicit reward model in iterative DPO approaches.
Paper Structure (18 sections, 5 equations, 3 figures, 8 tables, 1 algorithm)

This paper contains 18 sections, 5 equations, 3 figures, 8 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of methods for learning reward models explicitly and implicitly (via DPO). Figure adapted from rafailov2024direct.
  • Figure 2: Examples of different types of distributional shifts for reward models and accuracy drops on real-world datasets.
  • Figure 3: (a) The aggregated mean ID and OOD accuracy for different experiments across Setting I: a mixture of all distribution shifts in Table \ref{['tab:settings']}. (b) The proportion of experiments where EXRM outperform DPORM in Setting I with three models and three seeds. (c) Results on specific types of distributional shift Setting II in Table \ref{['tab:settings']}. (c-Top) The response shift evaluated on UltraFeedBack (ID) and our annotated dataset based on the generation of LLaMA3-8B (OOD). (c-Bottom) Prompt shift evaluated on summarization TL;DR (ID), CNN and DailyMail (OOD).