Table of Contents
Fetching ...

Information-Theoretic Reward Decomposition for Generalizable RLHF

Liyuan Mao, Haoran Xu, Amy Zhang, Weinan Zhang, Chenjia Bai

TL;DR

Addressing generalization gaps in RLHF reward models, the paper proposes a novel information-theoretic decomposition of the reward into a prompt-free component $r_2$ and a prompt-related component $r_1$, derived without extra models via a mutual-information objective. It demonstrates the existence of feasible $r_1^*$ and $r_2^*$ with $\Delta r_\theta = \Delta r_1^* + \Delta r_2^*$ and provides a practical approach to estimate $\Delta r_2^*(y_1,y_2)$ using binary search and importance sampling over $P(X|Y_1,Y_2)$. Reward learning is then guided by prioritizing data with small prompt-free gaps $\Delta r_2$, encouraging the model to focus on prompt-related information while reducing prompt-free prejudice. Empirical results on toy scenarios and standard RLHF benchmarks show improved reward-model alignment and better generalization of the induced policy across base models such as $\text{LLaMA-3-8B-Instruct}$ and $\text{Mistral-7B-Instruct}$.

Abstract

A generalizable reward model is crucial in Reinforcement Learning from Human Feedback (RLHF) as it enables correctly evaluating unseen prompt-response pairs. However, existing reward models lack this ability, as they are typically trained by increasing the reward gap between chosen and rejected responses, while overlooking the prompts that the responses are conditioned on. Consequently, when the trained reward model is evaluated on prompt-response pairs that lie outside the data distribution, neglecting the effect of prompts may result in poor generalization of the reward model. To address this issue, we decompose the reward value into two independent components: prompt-free reward and prompt-related reward. Prompt-free reward represents the evaluation that is determined only by responses, while the prompt-related reward reflects the reward that derives from both the prompt and the response. We extract these two components from an information-theoretic perspective, which requires no extra models. Subsequently, we propose a new reward learning algorithm by prioritizing data samples based on their prompt-free reward values. Through toy examples, we demonstrate that the extracted prompt-free and prompt-related rewards effectively characterize two parts of the reward model. Further, standard evaluations show that our method improves both the alignment performance and the generalization capability of the reward model.

Information-Theoretic Reward Decomposition for Generalizable RLHF

TL;DR

Addressing generalization gaps in RLHF reward models, the paper proposes a novel information-theoretic decomposition of the reward into a prompt-free component and a prompt-related component , derived without extra models via a mutual-information objective. It demonstrates the existence of feasible and with and provides a practical approach to estimate using binary search and importance sampling over . Reward learning is then guided by prioritizing data with small prompt-free gaps , encouraging the model to focus on prompt-related information while reducing prompt-free prejudice. Empirical results on toy scenarios and standard RLHF benchmarks show improved reward-model alignment and better generalization of the induced policy across base models such as and .

Abstract

A generalizable reward model is crucial in Reinforcement Learning from Human Feedback (RLHF) as it enables correctly evaluating unseen prompt-response pairs. However, existing reward models lack this ability, as they are typically trained by increasing the reward gap between chosen and rejected responses, while overlooking the prompts that the responses are conditioned on. Consequently, when the trained reward model is evaluated on prompt-response pairs that lie outside the data distribution, neglecting the effect of prompts may result in poor generalization of the reward model. To address this issue, we decompose the reward value into two independent components: prompt-free reward and prompt-related reward. Prompt-free reward represents the evaluation that is determined only by responses, while the prompt-related reward reflects the reward that derives from both the prompt and the response. We extract these two components from an information-theoretic perspective, which requires no extra models. Subsequently, we propose a new reward learning algorithm by prioritizing data samples based on their prompt-free reward values. Through toy examples, we demonstrate that the extracted prompt-free and prompt-related rewards effectively characterize two parts of the reward model. Further, standard evaluations show that our method improves both the alignment performance and the generalization capability of the reward model.

Paper Structure

This paper contains 31 sections, 3 theorems, 52 equations, 6 figures, 12 tables, 2 algorithms.

Key Result

Theorem 1

When the value of $r_2$ depends only on the response, i.e. $r_2(x, y) = r_2(y)$, MI$(\tilde{Z}\|\tilde{W})$$=$MI$(\tilde{Z}\|W)$.

Figures (6)

  • Figure 1: Left: reward gaps calculated with corresponding prompt and randomly sampled prompts using QRM-Llama3-8B on two different datasets that were used for training. When calculating with other prompts, the curves show the mean and the std. Right: illustrative failure case where the reward gap overly depends on the responses. Solid lines represent corresponding prompt-response pairs (used in training), while dashed lines represent non-corresponding pairs (unseen during training). Since the reward gap overly depends on the responses, it generalizes poorly to novel prompt-response pairs constructed even with seen prompts.
  • Figure 2: (a) Information in $W$ and $\tilde{W}$. (b) Desired information in $Z$ and $\tilde{Z}$. (c) Undesired information in $Z$ and $\tilde{Z}$.
  • Figure 3: (Illustrative) We characterize training data samples in a 2-dimensional quadrant diagram, with the decomposed reward gaps $\Delta r_1$ and $\Delta r_2$. (a) shows initial data samples before training. (b) After training, the ideal distribution should be centered on the positive half of the $\Delta r_1$-axis, indicating that the preference depends solely on prompt-related information. (c-d) However, in the training process of $\Delta r_\theta$, the update following the BT model can only ensure the data points move in at least one of the positive directions of the $\Delta r_1$-axis or $\Delta r_2$-axis (up or right). If the update is based on samples with small $\Delta r_2$ (e.g. the left half), their movement upwards or to the right, along with other unupdated samples (e.g. the right half), will cause the distribution to be more centered on the positive $\Delta r_1$-axis. On the other hand, if the update is based on all samples, as shown in (d), the movement of samples with large $\Delta r_2$ values to the right exacerbates the existing prejudices, causing the distribution to become more centered on the positive $\Delta r_2$-axis.
  • Figure 4: (a): Visualizations for the length-biased dataset. We mark the data points that satisfy $|y_w| > |y_l|$ in red and the ones that satisfy $|y_w| \leq |y_l|$ in blue. (b): Visualizations for the adversarial prompt dataset. We mark adversarial data in red and original data in blue. In both (a) and (b), the top shows ordinary training data, and the bottom shows prioritized training data.
  • Figure :
  • ...and 1 more figures

Theorems & Definitions (6)

  • Theorem 1
  • Theorem 2
  • proof
  • proof
  • Lemma 1
  • proof