Table of Contents
Fetching ...

Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning

Mathieu Rita, Florian Strub, Rahma Chaabouni, Paul Michel, Emmanuel Dupoux, Olivier Pietquin

TL;DR

The paper tackles reward over-optimization (ROO) in reinforcement learning fine-tuning of LLMs by shifting the objective from maximizing a reward to calibrating it against human demonstrations. The proposed Reward Calibration from Demonstration (RCfD) uses demonstrations and a reward model to align the LM’s rewards with demonstrated scores, reducing reward-model gaming and promoting natural, diverse outputs. Through three use cases—sequence-level log-likelihood calibration, single-reward ROO mitigation, and multi-reward calibration—RCfD achieves performance comparable to tuned baselines while offering improved stability and predictability. The approach hinges on demonstration data to guide the reward distribution, with limitations including data requirements and potential biases, but it shows strong promise for robust, multi-reward RL in complex language tasks.

Abstract

While Reinforcement Learning (RL) has been proven essential for tuning large language models (LLMs), it can lead to reward over-optimization (ROO). Existing approaches address ROO by adding KL regularization, requiring computationally expensive hyperparameter tuning. Additionally, KL regularization focuses solely on regularizing the language policy, neglecting a potential source of regularization: the reward function itself. Inspired by demonstration-guided RL, we here introduce the Reward Calibration from Demonstration (RCfD), which leverages human demonstrations and a reward model to recalibrate the reward objective. Formally, given a prompt, the RCfD objective minimizes the distance between the demonstrations' and LLM's rewards rather than directly maximizing the reward function. This objective shift avoids incentivizing the LLM to exploit the reward model and promotes more natural and diverse language generation. We show the effectiveness of RCfD on three language tasks, which achieves comparable performance to carefully tuned baselines while mitigating ROO.

Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning

TL;DR

The paper tackles reward over-optimization (ROO) in reinforcement learning fine-tuning of LLMs by shifting the objective from maximizing a reward to calibrating it against human demonstrations. The proposed Reward Calibration from Demonstration (RCfD) uses demonstrations and a reward model to align the LM’s rewards with demonstrated scores, reducing reward-model gaming and promoting natural, diverse outputs. Through three use cases—sequence-level log-likelihood calibration, single-reward ROO mitigation, and multi-reward calibration—RCfD achieves performance comparable to tuned baselines while offering improved stability and predictability. The approach hinges on demonstration data to guide the reward distribution, with limitations including data requirements and potential biases, but it shows strong promise for robust, multi-reward RL in complex language tasks.

Abstract

While Reinforcement Learning (RL) has been proven essential for tuning large language models (LLMs), it can lead to reward over-optimization (ROO). Existing approaches address ROO by adding KL regularization, requiring computationally expensive hyperparameter tuning. Additionally, KL regularization focuses solely on regularizing the language policy, neglecting a potential source of regularization: the reward function itself. Inspired by demonstration-guided RL, we here introduce the Reward Calibration from Demonstration (RCfD), which leverages human demonstrations and a reward model to recalibrate the reward objective. Formally, given a prompt, the RCfD objective minimizes the distance between the demonstrations' and LLM's rewards rather than directly maximizing the reward function. This objective shift avoids incentivizing the LLM to exploit the reward model and promotes more natural and diverse language generation. We show the effectiveness of RCfD on three language tasks, which achieves comparable performance to carefully tuned baselines while mitigating ROO.
Paper Structure (30 sections, 3 equations, 5 figures, 6 tables)

This paper contains 30 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The RCfD objective is the L2-distance between the reward from the LM and the reward from the demonstration. Given a prompt $x$, a demonstration $y_{d}$, and the LLM continuation $y$, the RM computes the demonstration reward $R_{RM}(x,y_{d})$, and the LM reward $R_{RM}(x,y)$. Instead of maximizing $R(x,y)$ as in standard RL, we here aim at maximizing the RCfD objective defined as $R_{RCfD\xspace}(x,y) = -||R_{RM}(x,y) - R_{RM}(x,y_{d})||^2_2$.
  • Figure 2: Average log-likelihood as a function of the generation length. Optimizing $R_{\beta=0}$ finds LLM exploits to minimize the likelihood, while imitation-based models suffer from exposure bias. Only $R_{CfD\xspace}$ has an average log-likelihood that matches human behavior.
  • Figure 3: Results of the Movie review task (left) Comparison between the reward distribution of human demonstrations and LLM generations for the different methods. Vertical lines mark the mean of the distribution. (right) Normalized evaluation score of each LLM. RCfD outperforms the base model and SFT by matching the reward demonstration distribution. Absolute scores are provided in Appendix \ref{['app:movie_review']}. If carefully tuned, optimizing $R_{\beta}$ can match the reward distribution, but subtle changes in $\beta$ also induce drastic behavior changes. When $\beta=0$, the LM achieves near-optimal rewards, yet the policy is degraded (naturalness close to 0), illustrating an instance of ROO.
  • Figure 4: The Pareto front emerges when optimizing $R_{RM}$ and -$R_{length}$ for the summarization task. This front is delineated by varying the balancing weight $\alpha$ in $R_{\alpha}$ and using PPO. Notably, the average coordinate of the demonstration rewards is located on this front. RCfD facilitates the direct targeting of this coordinate.
  • Figure 5: (a) Average log-likelihood as a function of the generation length (b)Distribution of the average log-likelihood of human sentences over the different baselines (generations of $700$ tokens).