Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization
Juntao Dai, Taiye Chen, Yaodong Yang, Qian Zheng, Gang Pan
TL;DR
Reward over-optimization in RLHF arises from reward-model extrapolation on OOD responses during RL. The paper introduces BSPO, which grounds policy evaluation in the in-distribution region by using a behavior policy derived from the next-token distribution and a behavior-supported Bellman operator to penalize OOD values while preserving ID values, with theoretical contraction and monotonic improvement guarantees. Empirically, BSPO outperforms baselines by reducing OOD generation and reliably finding optimal ID policies across model scales and datasets, even under label noise. This approach provides a principled, data-efficient route to stable RLHF that aligns policies with true human objectives while mitigating extrapolation-induced over-optimization.
Abstract
Reinforcement learning from human feedback (RLHF) is an effective method for aligning large language models (LLMs) with human values. However, reward over-optimization remains an open challenge leading to discrepancies between the performance of LLMs under the reward model and the true human objectives. A primary contributor to reward over-optimization is the extrapolation error that arises when the reward model evaluates out-of-distribution (OOD) responses. However, current methods still fail to prevent the increasing frequency of OOD response generation during the reinforcement learning (RL) process and are not effective at handling extrapolation errors from OOD responses. In this work, we propose the Behavior-Supported Policy Optimization (BSPO) method to mitigate the reward over-optimization issue. Specifically, we define behavior policy as the next token distribution of the reward training dataset to model the in-distribution (ID) region of the reward model. Building on this, we introduce the behavior-supported Bellman operator to regularize the value function, penalizing all OOD values without impacting the ID ones. Consequently, BSPO reduces the generation of OOD responses during the RL process, thereby avoiding overestimation caused by the reward model's extrapolation errors. Theoretically, we prove that BSPO guarantees a monotonic improvement of the supported policy until convergence to the optimal behavior-supported policy. Empirical results from extensive experiments show that BSPO outperforms baselines in preventing reward over-optimization due to OOD evaluation and finding the optimal ID policy.
