Table of Contents
Fetching ...

Learning a Pessimistic Reward Model in RLHF

Yinglun Xu, Hangoo Kang, Tarun Suresh, Yuxuan Wan, Gagandeep Singh

TL;DR

The paper addresses reward hacking in offline RLHF where a proxy reward model $r$ can overestimate outputs. It introduces PET, a pessimistic reward fine-tuning method that adversarially trains the reward model against a rejection-sampling-based policy, removing the need for KL regularization in policy optimization. The authors propose a three-step RLHF framework: train a proxy reward by minimizing the dataset loss $\mathcal{L}_{\mathcal{D}}(r)$, apply PET to obtain a pessimistic reward by solving a minimax objective $\min_{r} V^\mu_r(\pi_{RS}) - V^\mu_r(\pi_{ref})$ plus $\beta \mathcal{L}_{\mathcal{D}}(r)$, and optimize a policy using rejection sampling on the pessimistic reward. Theoretical guarantees link dataset coverage to performance, and experiments on TL;DR and IMDB show PET-based policies achieve competitive or superior results compared to state-of-the-art RLHF methods, even with high KL divergence from the dataset. The approach demonstrates that a pessimistic reward model can guide greedy policy search without regularization, mitigating reward hacking in offline settings.

Abstract

This work proposes `PET', a novel pessimistic reward fine-tuning method, to learn a pessimistic reward model robust against reward hacking in offline reinforcement learning from human feedback (RLHF). Traditional reward modeling techniques in RLHF train an imperfect reward model, on which a KL regularization plays a pivotal role in mitigating reward hacking when optimizing a policy. Such an intuition-based method still suffers from reward hacking, and the policies with large KL divergence from the dataset distribution are excluded during learning. In contrast, we show that when optimizing a policy on a pessimistic reward model fine-tuned through PET, reward hacking can be prevented without relying on any regularization. We test our methods on the standard TL;DR summarization dataset. We find that one can learn a high-quality policy on our pessimistic reward without using any regularization. Such a policy has a high KL divergence from the dataset distribution while having high performance in practice. In summary, our work shows the feasibility of learning a pessimistic reward model against reward hacking. The agent can greedily search for the policy with a high pessimistic reward without suffering from reward hacking.

Learning a Pessimistic Reward Model in RLHF

TL;DR

The paper addresses reward hacking in offline RLHF where a proxy reward model can overestimate outputs. It introduces PET, a pessimistic reward fine-tuning method that adversarially trains the reward model against a rejection-sampling-based policy, removing the need for KL regularization in policy optimization. The authors propose a three-step RLHF framework: train a proxy reward by minimizing the dataset loss , apply PET to obtain a pessimistic reward by solving a minimax objective plus , and optimize a policy using rejection sampling on the pessimistic reward. Theoretical guarantees link dataset coverage to performance, and experiments on TL;DR and IMDB show PET-based policies achieve competitive or superior results compared to state-of-the-art RLHF methods, even with high KL divergence from the dataset. The approach demonstrates that a pessimistic reward model can guide greedy policy search without regularization, mitigating reward hacking in offline settings.

Abstract

This work proposes `PET', a novel pessimistic reward fine-tuning method, to learn a pessimistic reward model robust against reward hacking in offline reinforcement learning from human feedback (RLHF). Traditional reward modeling techniques in RLHF train an imperfect reward model, on which a KL regularization plays a pivotal role in mitigating reward hacking when optimizing a policy. Such an intuition-based method still suffers from reward hacking, and the policies with large KL divergence from the dataset distribution are excluded during learning. In contrast, we show that when optimizing a policy on a pessimistic reward model fine-tuned through PET, reward hacking can be prevented without relying on any regularization. We test our methods on the standard TL;DR summarization dataset. We find that one can learn a high-quality policy on our pessimistic reward without using any regularization. Such a policy has a high KL divergence from the dataset distribution while having high performance in practice. In summary, our work shows the feasibility of learning a pessimistic reward model against reward hacking. The agent can greedily search for the policy with a high pessimistic reward without suffering from reward hacking.

Paper Structure

This paper contains 22 sections, 2 theorems, 16 equations, 1 figure, 7 tables, 4 algorithms.

Key Result

Proposition 2.1

For any prompt distribution $\mu$, base policy $\pi_0$, number of sampling $n$, and reward model $r_0$, the rejection sampling policies satisfy:

Figures (1)

  • Figure 1: A three-step reward-based learning framework. The first step is the traditional reward modeling that trains a reward model with minimal loss on predicting the dataset preference. The second step fine-tunes the learned reward model to make it pessimistic. Particularly, the reward model is adversarially trained against a policy model induced by the rejection sampling process. The reward model should still induce minimal prediction loss on the dataset. In the last step, the framework optimizes a policy on the pessimistic reward and outputs the learned policy.

Theorems & Definitions (8)

  • Proposition 2.1
  • Remark 3.1
  • Definition 3.2
  • Theorem 3.3
  • Remark 3.4
  • Remark A.1
  • proof
  • proof