Table of Contents
Fetching ...

Towards Reliable Alignment: Uncertainty-aware RLHF

Debangshu Banerjee, Aditya Gopalan

TL;DR

This work theoretically proves that the proposed methodology has less risk than the vanilla method, and corroborates the theoretical results with experiments based on designing an ensemble of reward models that align a language model using the proposed methodology.

Abstract

Recent advances in aligning Large Language Models with human preferences have benefited from larger reward models and better preference data. However, most of these methodologies rely on the accuracy of the reward model. The reward models used in Reinforcement Learning with Human Feedback (RLHF) are typically learned from small datasets using stochastic optimization algorithms, making them prone to high variability. We illustrate the inconsistencies between reward models empirically on numerous open-source datasets. We theoretically show that the fluctuation of the reward models can be detrimental to the alignment problem because the derived policies are more overfitted to the reward model and, hence, are riskier if the reward model itself is uncertain. We use concentration of measure to motivate an uncertainty-aware, conservative algorithm for policy optimization. We show that such policies are more risk-averse in the sense that they are more cautious of uncertain rewards. We theoretically prove that our proposed methodology has less risk than the vanilla method. We corroborate our theoretical results with experiments based on designing an ensemble of reward models. We use this ensemble of reward models to align a language model using our methodology and observe that our empirical findings match our theoretical predictions.

Towards Reliable Alignment: Uncertainty-aware RLHF

TL;DR

This work theoretically proves that the proposed methodology has less risk than the vanilla method, and corroborates the theoretical results with experiments based on designing an ensemble of reward models that align a language model using the proposed methodology.

Abstract

Recent advances in aligning Large Language Models with human preferences have benefited from larger reward models and better preference data. However, most of these methodologies rely on the accuracy of the reward model. The reward models used in Reinforcement Learning with Human Feedback (RLHF) are typically learned from small datasets using stochastic optimization algorithms, making them prone to high variability. We illustrate the inconsistencies between reward models empirically on numerous open-source datasets. We theoretically show that the fluctuation of the reward models can be detrimental to the alignment problem because the derived policies are more overfitted to the reward model and, hence, are riskier if the reward model itself is uncertain. We use concentration of measure to motivate an uncertainty-aware, conservative algorithm for policy optimization. We show that such policies are more risk-averse in the sense that they are more cautious of uncertain rewards. We theoretically prove that our proposed methodology has less risk than the vanilla method. We corroborate our theoretical results with experiments based on designing an ensemble of reward models. We use this ensemble of reward models to align a language model using our methodology and observe that our empirical findings match our theoretical predictions.

Paper Structure

This paper contains 25 sections, 6 theorems, 31 equations, 11 figures, 8 tables.

Key Result

Theorem 2.2

Under Assumption assm: gaussian, for any $\beta > 0$, the following holds with probability at least $1 - \exp\left(- \frac{\mathrm{XA}}{\beta^2}\right)$:

Figures (11)

  • Figure 1: Reward scores assigned by $10$ reward models on the same prompt-response pair. The reward models are identical in that they are trained independently on the same dataset, with the same hyperparameters and number of epochs. Despite this, we see a wide variation in the score assigned by each model.
  • Figure 2: A $3$-armed bandit problem illustrating true rewards $r_1^*, r_2^*, r_3^*$ (green circles), estimated rewards (blue circles) $\widehat{R}_1, \widehat{R}_2, \widehat{R}_3$, and uncertainty intervals (red brackets). Arm 1 has the lowest true reward, whereas the highest estimate $\widehat{R}_1$. In contrast, arms 2 and 3 have lower reward estimates $\widehat{R}_2$ and $\widehat{R}_3$, respectively. A naive policy improvement based on only the estimated rewards $\widehat{R}_i$ would increase the probability on Arm $1$, leading to a lower (true) expected return. A more conservative policy improvement strategy should factor in the uncertainty of the estimate of Arm $1$ and assign a lower probability to it, resulting in a higher expected return.
  • Figure 3: In the high-variability setting, variances of reward estimates range between $(3,100)$. Method 2 (variance-aware) exhibits significantly lower return variance than Method 1 (variance-unaware), confirming its risk-averse nature. The standard deviation for Method 2 is $0.04$, while for Method 1 it is $0.13$. The mean returns for both methods are comparable: $4.643$ for Method 1 and $4.644$ for Method 2.
  • Figure 4: In the low-variability setting, variances of reward estimates range between $(70,100)$. Both methods perform similarly, with Method 2 (variance-aware) having a standard deviation of $0.12$ and Method 1 (variance-unaware) having a standard deviation of $0.14$. The mean returns for Method 1 and Method 2 are $0.14$ and $0.13$, respectively.
  • Figure 6: (Top Row) The distribution of sample variances of the reward on the accepted responses. The $10$ reward models calculate the sample variance. We note from the median of the sample variances that half of the dataset tends to have variances of the rewards greater than $3.81$, with a maximum close to $10$. This corroborates our hypothesis that different reward models will exhibit variability in their reward assignments for the same prompt-response pair. (Bottom Row) The distribution of sample variance of the rewards difference between accepted and rejected responses. The figure shows that the reward models are not merely translations of one another, and the variance arises due to the statistical nature of learning these reward models and the stochasticity of the optimization process.
  • ...and 6 more figures

Theorems & Definitions (14)

  • Theorem 2.2
  • Remark 2.3
  • Definition 3.1: Variance-Unaware Policy, $\pi_1$
  • Definition 3.2: Variance-Aware Policy, $\pi_2$
  • Theorem 3.3
  • Remark 3.4
  • Remark 3.5
  • Theorem 3.6
  • Theorem 6.1
  • proof
  • ...and 4 more