Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

Jiawei Huang; Bingcong Li; Christoph Dann; Niao He

Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

Jiawei Huang, Bingcong Li, Christoph Dann, Niao He

TL;DR

This work addresses sample efficiency in RLHF by exploiting imperfect yet informative source reward models under KL-regularized objectives. It develops Transfer Policy Optimization (TPO), a theory-backed framework that combines online learning with transfer from source rewards and a self-transfer mechanism, guided by policy-value and policy-coverage principles. The authors establish that KL regularization yields a coverability bound linking a policy’s value gap to its ability to cover the optimal policy, and they prove a $\widetilde{O}(T^{-1/2})$ convergence rate for the self-transfer policy distilled from online data. They further propose an efficient, empirical variant that uses win-rate estimates to select transfer candidates, achieving improved performance and computational efficiency. Empirical validation on summarization tasks with T5 and various reward models demonstrates tangible improvements in sample efficiency and robustness, with the approach being modular and compatible with multiple policy-optimization methods.

Abstract

Sample efficiency is critical for online Reinforcement Learning from Human Feedback (RLHF). While existing works investigate sample-efficient online exploration strategies, the potential of utilizing misspecified yet relevant reward models to accelerate learning remains underexplored. This paper studies how to transfer knowledge from those imperfect reward models in online RLHF. We start by identifying a novel property due to KL-regularization in the RLHF objective: \emph{a policy's coverability of the optimal policy is captured by its sub-optimality}. Building on this insight, we propose novel transfer learning principles and a theoretical algorithm -- \emph{\textbf{T}ransfer \textbf{P}olicy \textbf{O}ptimization (\textbf{TPO})} -- with provable benefits compared to standard online learning. Empirically, inspired by our theoretical findings, we develop a win-rate-based transfer policy selection strategy with improved computational efficiency. Moreover, our empirical transfer learning technique is modular and can be integrated with various policy optimization methods, such as DPO, IPO and XPO, to further enhance their performance. We validate the effectiveness of our method through experiments on summarization tasks.

Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

TL;DR

Abstract

Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (60)