Table of Contents
Fetching ...

Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

Jiawei Huang, Bingcong Li, Christoph Dann, Niao He

TL;DR

This work addresses sample efficiency in RLHF by exploiting imperfect yet informative source reward models under KL-regularized objectives. It develops Transfer Policy Optimization (TPO), a theory-backed framework that combines online learning with transfer from source rewards and a self-transfer mechanism, guided by policy-value and policy-coverage principles. The authors establish that KL regularization yields a coverability bound linking a policy’s value gap to its ability to cover the optimal policy, and they prove a $\widetilde{O}(T^{-1/2})$ convergence rate for the self-transfer policy distilled from online data. They further propose an efficient, empirical variant that uses win-rate estimates to select transfer candidates, achieving improved performance and computational efficiency. Empirical validation on summarization tasks with T5 and various reward models demonstrates tangible improvements in sample efficiency and robustness, with the approach being modular and compatible with multiple policy-optimization methods.

Abstract

Sample efficiency is critical for online Reinforcement Learning from Human Feedback (RLHF). While existing works investigate sample-efficient online exploration strategies, the potential of utilizing misspecified yet relevant reward models to accelerate learning remains underexplored. This paper studies how to transfer knowledge from those imperfect reward models in online RLHF. We start by identifying a novel property due to KL-regularization in the RLHF objective: \emph{a policy's coverability of the optimal policy is captured by its sub-optimality}. Building on this insight, we propose novel transfer learning principles and a theoretical algorithm -- \emph{\textbf{T}ransfer \textbf{P}olicy \textbf{O}ptimization (\textbf{TPO})} -- with provable benefits compared to standard online learning. Empirically, inspired by our theoretical findings, we develop a win-rate-based transfer policy selection strategy with improved computational efficiency. Moreover, our empirical transfer learning technique is modular and can be integrated with various policy optimization methods, such as DPO, IPO and XPO, to further enhance their performance. We validate the effectiveness of our method through experiments on summarization tasks.

Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

TL;DR

This work addresses sample efficiency in RLHF by exploiting imperfect yet informative source reward models under KL-regularized objectives. It develops Transfer Policy Optimization (TPO), a theory-backed framework that combines online learning with transfer from source rewards and a self-transfer mechanism, guided by policy-value and policy-coverage principles. The authors establish that KL regularization yields a coverability bound linking a policy’s value gap to its ability to cover the optimal policy, and they prove a convergence rate for the self-transfer policy distilled from online data. They further propose an efficient, empirical variant that uses win-rate estimates to select transfer candidates, achieving improved performance and computational efficiency. Empirical validation on summarization tasks with T5 and various reward models demonstrates tangible improvements in sample efficiency and robustness, with the approach being modular and compatible with multiple policy-optimization methods.

Abstract

Sample efficiency is critical for online Reinforcement Learning from Human Feedback (RLHF). While existing works investigate sample-efficient online exploration strategies, the potential of utilizing misspecified yet relevant reward models to accelerate learning remains underexplored. This paper studies how to transfer knowledge from those imperfect reward models in online RLHF. We start by identifying a novel property due to KL-regularization in the RLHF objective: \emph{a policy's coverability of the optimal policy is captured by its sub-optimality}. Building on this insight, we propose novel transfer learning principles and a theoretical algorithm -- \emph{\textbf{T}ransfer \textbf{P}olicy \textbf{O}ptimization (\textbf{TPO})} -- with provable benefits compared to standard online learning. Empirically, inspired by our theoretical findings, we develop a win-rate-based transfer policy selection strategy with improved computational efficiency. Moreover, our empirical transfer learning technique is modular and can be integrated with various policy optimization methods, such as DPO, IPO and XPO, to further enhance their performance. We validate the effectiveness of our method through experiments on summarization tasks.

Paper Structure

This paper contains 54 sections, 33 theorems, 115 equations, 2 figures, 2 tables.

Key Result

Lemma 2.3

Given a dataset $\mathcal{D}$ generated by a policy $\pi^\mathcal{D}$, running ${\text{RPO}}$ with any $\mathcal{R}$ including $r^*$ and $\widetilde{\Pi}$ yields $\widehat{\pi}$, s.t., $\forall \pi\in\widetilde{\Pi},J_\beta(\pi) - J_\beta(\widehat{\pi}) = \widetilde{O}( e^{2{\Rmax}} {\texttt{Cov}}^{

Figures (2)

  • Figure 1: The standard online RLHF pipeline only involves learning from online human feedback (left). Our setting additionally leverages available imperfect reward models via transfer learning (right). Inspired by the structure induced by KL regularization, we propose novel principles for transfer learning in online RLHF: (1) selecting transfer policy $\pi_\text{Transfer}$ with the highest policy value; (2) self-transfer learning---involving as a candidate the policy $\pi_{\text{Dstl}}$distilled from online collected data by offline learning techniques.
  • Figure 2: Deeper investigation on the source reward models selection process. We report the allocation of transfer budgets on each source tasks averaged over 3 trials (top figure) and the win rates ${\mathbb{P}}_{r^*}(\cdot\succ\pi^k_\texttt{OL})$ (bottom figure) for iterations $k=1,2,3$. Due to space limit, we use abbreviation rather than the full name of source tasks. R, B, TB, TL and NT stand for ROUGE-Lsum, BERTScore, T5-Base, T5-Large and No Transfer, respectively.

Theorems & Definitions (60)

  • Definition 2.2
  • Lemma 2.3: Offline RLHF; Thm. 5.3 in liu2024provably; Informal
  • Lemma 2.4: Online RLHF; Thm. 3.1 in xie2024exploratory; Informal
  • Lemma 3.0
  • Theorem 3.1
  • Corollary 4.1
  • Lemma 4.1: Value Est Error for $\{\pi^*_{r^w}\}_{w\in[W]}$
  • Lemma 4.1: Value Est Error for $\pi_\SELF$
  • Theorem 4.2: Total Regret
  • Corollary 4.3
  • ...and 50 more