Table of Contents
Fetching ...

Reward-Robust RLHF in LLMs

Yuzi Yan, Xingzhou Lou, Jialian Li, Yiping Zhang, Jian Xie, Chao Yu, Yu Wang, Dong Yan, Yuan Shen

TL;DR

This paper introduces a reward-robust RLHF framework aimed at addressing fundamental challenges of Reinforcement Learning from Human Feedback, and introduces a novel optimization objective that carefully balances performance and robustness by incorporating Bayesian Reward Model Ensembles (BRME) to model the uncertainty set of reward functions.

Abstract

As Large Language Models (LLMs) continue to progress toward more advanced forms of intelligence, Reinforcement Learning from Human Feedback (RLHF) is increasingly seen as a key pathway toward achieving Artificial General Intelligence (AGI). However, the reliance on reward-model-based (RM-based) alignment methods introduces significant challenges due to the inherent instability and imperfections of Reward Models (RMs), which can lead to critical issues such as reward hacking and misalignment with human intentions. In this paper, we introduce a reward-robust RLHF framework aimed at addressing these fundamental challenges, paving the way for more reliable and resilient learning in LLMs. Our approach introduces a novel optimization objective that carefully balances performance and robustness by incorporating Bayesian Reward Model Ensembles (BRME) to model the uncertainty set of reward functions. This allows the framework to integrate both nominal performance and minimum reward signals, ensuring more stable learning even with imperfect RMs. Empirical results demonstrate that our framework consistently outperforms baselines across diverse benchmarks, showing improved accuracy and long-term stability. We also provide a theoretical analysis, demonstrating that reward-robust RLHF approaches the stability of constant reward settings, which proves to be acceptable even in a stochastic-case analysis. Together, these contributions highlight the framework potential to enhance both the performance and stability of LLM alignment.

Reward-Robust RLHF in LLMs

TL;DR

This paper introduces a reward-robust RLHF framework aimed at addressing fundamental challenges of Reinforcement Learning from Human Feedback, and introduces a novel optimization objective that carefully balances performance and robustness by incorporating Bayesian Reward Model Ensembles (BRME) to model the uncertainty set of reward functions.

Abstract

As Large Language Models (LLMs) continue to progress toward more advanced forms of intelligence, Reinforcement Learning from Human Feedback (RLHF) is increasingly seen as a key pathway toward achieving Artificial General Intelligence (AGI). However, the reliance on reward-model-based (RM-based) alignment methods introduces significant challenges due to the inherent instability and imperfections of Reward Models (RMs), which can lead to critical issues such as reward hacking and misalignment with human intentions. In this paper, we introduce a reward-robust RLHF framework aimed at addressing these fundamental challenges, paving the way for more reliable and resilient learning in LLMs. Our approach introduces a novel optimization objective that carefully balances performance and robustness by incorporating Bayesian Reward Model Ensembles (BRME) to model the uncertainty set of reward functions. This allows the framework to integrate both nominal performance and minimum reward signals, ensuring more stable learning even with imperfect RMs. Empirical results demonstrate that our framework consistently outperforms baselines across diverse benchmarks, showing improved accuracy and long-term stability. We also provide a theoretical analysis, demonstrating that reward-robust RLHF approaches the stability of constant reward settings, which proves to be acceptable even in a stochastic-case analysis. Together, these contributions highlight the framework potential to enhance both the performance and stability of LLM alignment.
Paper Structure (37 sections, 1 theorem, 14 equations, 10 figures, 8 tables)

This paper contains 37 sections, 1 theorem, 14 equations, 10 figures, 8 tables.

Key Result

Lemma 1

If the reward model provides a constant reward for all actions during PPO training, the actor will not be optimized, as the gradient of the PPO objective function with respect to the policy parameters will be zero.

Figures (10)

  • Figure 1: Diagram and synthetic experiment results with the toy model. In the standard RLHF pipeline with the upperside gray frame, even with a dataset annotated by a global annotator, the obtained RM and the actor trained by PPO stays imperfect. In constrast, in our reward-robust RLHF pipeline with the downside orange frame, with an integration of the nominal reward functions and the uncertainty set, we can obtain the optimal actor within PPO.
  • Figure 2: Diagram for BRME. Each head outputs the mean and the std of the corresponding reward distribution and reparametrization is emplyed to address the non-differentialbility.
  • Figure 3: Evaluation results of reward-robust RLHF framework with the performance-robustness trade-off hyperparameter $\lambda$ varying, where the objective function $J_{\lambda}(\theta) :=\lambda J_{\text{perform}}(\theta) + (1-\lambda) J_{\text{robust}}(\theta)$. Note that when $\lambda=1$, the algorithm reduces to standard RLHF with a single nominal reward model.
  • Figure 4: Performance shift in Section \ref{['subsec:Over-Scoring vs. Under-Scoring']} and \ref{['subsec: worst case analysis in Robust RLHF']} suggest: 1) under-scoring is generally preferable to over-scoring, and 2) leveraging the minimum reward in the uncertainty set helps mitigate performance decline when the RM underperforms.
  • Figure 5: The left figure shows that as training progresses, the performance under the minimum reward setting steadily improves, indicating that conservative optimization benefits the long-term PPO optimization of LLMs. The right figure illustrates the effect of minimizing the reward on its value distribution, reducing the range and making under-scoring dominant.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Lemma 1
  • Remark 1