Table of Contents
Fetching ...

Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization

Juntao Dai, Taiye Chen, Yaodong Yang, Qian Zheng, Gang Pan

TL;DR

Reward over-optimization in RLHF arises from reward-model extrapolation on OOD responses during RL. The paper introduces BSPO, which grounds policy evaluation in the in-distribution region by using a behavior policy derived from the next-token distribution and a behavior-supported Bellman operator to penalize OOD values while preserving ID values, with theoretical contraction and monotonic improvement guarantees. Empirically, BSPO outperforms baselines by reducing OOD generation and reliably finding optimal ID policies across model scales and datasets, even under label noise. This approach provides a principled, data-efficient route to stable RLHF that aligns policies with true human objectives while mitigating extrapolation-induced over-optimization.

Abstract

Reinforcement learning from human feedback (RLHF) is an effective method for aligning large language models (LLMs) with human values. However, reward over-optimization remains an open challenge leading to discrepancies between the performance of LLMs under the reward model and the true human objectives. A primary contributor to reward over-optimization is the extrapolation error that arises when the reward model evaluates out-of-distribution (OOD) responses. However, current methods still fail to prevent the increasing frequency of OOD response generation during the reinforcement learning (RL) process and are not effective at handling extrapolation errors from OOD responses. In this work, we propose the Behavior-Supported Policy Optimization (BSPO) method to mitigate the reward over-optimization issue. Specifically, we define behavior policy as the next token distribution of the reward training dataset to model the in-distribution (ID) region of the reward model. Building on this, we introduce the behavior-supported Bellman operator to regularize the value function, penalizing all OOD values without impacting the ID ones. Consequently, BSPO reduces the generation of OOD responses during the RL process, thereby avoiding overestimation caused by the reward model's extrapolation errors. Theoretically, we prove that BSPO guarantees a monotonic improvement of the supported policy until convergence to the optimal behavior-supported policy. Empirical results from extensive experiments show that BSPO outperforms baselines in preventing reward over-optimization due to OOD evaluation and finding the optimal ID policy.

Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization

TL;DR

Reward over-optimization in RLHF arises from reward-model extrapolation on OOD responses during RL. The paper introduces BSPO, which grounds policy evaluation in the in-distribution region by using a behavior policy derived from the next-token distribution and a behavior-supported Bellman operator to penalize OOD values while preserving ID values, with theoretical contraction and monotonic improvement guarantees. Empirically, BSPO outperforms baselines by reducing OOD generation and reliably finding optimal ID policies across model scales and datasets, even under label noise. This approach provides a principled, data-efficient route to stable RLHF that aligns policies with true human objectives while mitigating extrapolation-induced over-optimization.

Abstract

Reinforcement learning from human feedback (RLHF) is an effective method for aligning large language models (LLMs) with human values. However, reward over-optimization remains an open challenge leading to discrepancies between the performance of LLMs under the reward model and the true human objectives. A primary contributor to reward over-optimization is the extrapolation error that arises when the reward model evaluates out-of-distribution (OOD) responses. However, current methods still fail to prevent the increasing frequency of OOD response generation during the reinforcement learning (RL) process and are not effective at handling extrapolation errors from OOD responses. In this work, we propose the Behavior-Supported Policy Optimization (BSPO) method to mitigate the reward over-optimization issue. Specifically, we define behavior policy as the next token distribution of the reward training dataset to model the in-distribution (ID) region of the reward model. Building on this, we introduce the behavior-supported Bellman operator to regularize the value function, penalizing all OOD values without impacting the ID ones. Consequently, BSPO reduces the generation of OOD responses during the RL process, thereby avoiding overestimation caused by the reward model's extrapolation errors. Theoretically, we prove that BSPO guarantees a monotonic improvement of the supported policy until convergence to the optimal behavior-supported policy. Empirical results from extensive experiments show that BSPO outperforms baselines in preventing reward over-optimization due to OOD evaluation and finding the optimal ID policy.

Paper Structure

This paper contains 52 sections, 15 theorems, 60 equations, 14 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

For any policy $\pi \in \Pi$, the operator $\mathcal{T}^\pi_\beta$ is $\gamma$-contraction with respect to the $\mathcal{L}_\infty$ norm over the space $\mathcal{S}\times\mathcal{A}$.

Figures (14)

  • Figure 1: (a)Reward over-optimization. Although the performance of LLMs may seem to improve under the reward model (proxy reward), it deviates from the actual human objectives (gold reward). (b)Search in ID region. Our algorithm guides policy iteration within the ID region of the reward model, whereas others may enter the OOD region, suffering from extrapolation errors. (c) Hard to evaluate unsupported responses. Responses are categorized as supported or unsupported, depending on whether they include actions unsupported by the behavior policy ($\beta(a|s) = 0$). As policy iterates, the occurrence of unsupported responses increases. "Correct/Incorrect" indicates whether the proxy model's evaluation of a generated response aligns with the gold model. The proxy model predicts preference pairs well for supported responses but struggles with unsupported ones.
  • Figure 2: (a)Structure of our ScoreLM model. We retain the original language model head to predict the next-token distribution and initialize a score head to predict the reward. (b) Compare with Standard RM. The performance of ScoreLM is comparable to standard reward models under three scales on the test set. The short vertical lines indicate the standard deviation of four repetitions.
  • Figure 3: Main results. The training curves of various algorithms across three experimental settings show upward trends in proxy rewards. Most baselines suffer from reward over-optimization. In contrast, our BSPO algorithm effectively mitigates this issue and achieves the highest gold reward.
  • Figure 4: (a) Win Rates. The win rates of different algorithms against the initial SFT model using the 2.7B proxy model. (b) Count behavior-unsupported actions during RL. We track the average number of actions (tokens) that are not supported by the behavior policy for each response during RL. Over-optimization in traditional methods leads to a sharp rise in unsupported actions, as shown by the dashed line, while our BSPO algorithm keeps these actions consistently low during training. (c) Compare with KL penalty. Compared with the KL penalty method using different penalty coefficients, BSPO effectively avoids reward over-optimization at larger KL divergence distances.
  • Figure 5: The prediction accuracy of the proxy model on pairs of supported and unsupported responses is evaluated across three repeated standard PPO experiments, each conducted with different random seeds.
  • ...and 9 more figures

Theorems & Definitions (23)

  • Definition 1: Behavior-Supported Action
  • Theorem 1: Contraction of $\mathcal{T}^\pi_\beta$
  • Theorem 2: Fixed Points
  • Corollary 1: Supported Policy Optimization
  • Corollary 2
  • Theorem 3: Monotonicity to Optimality
  • Theorem 4: Contraction of $\mathcal{T}^\pi_{\beta,V}$
  • Corollary 3: Equivalent Policy Evaluation
  • Theorem 4: Contraction of $\mathcal{T}^\pi_\beta$
  • proof
  • ...and 13 more