Table of Contents
Fetching ...

Towards Reward Fairness in RLHF: From a Resource Allocation Perspective

Sheng Ouyang, Yulan Hu, Ge Chen, Qingyang Li, Fuzheng Zhang, Yong Liu

TL;DR

This work identifies reward unfairness in RLHF as a barrier to aligning LLMs with human preferences. It recasts preference learning as a resource-allocation problem, introducing a unified fairness metric and two practical methods—Fairness Regularization and Fairness Coefficient—to balance utility and fairness in reward distributions. The authors apply these ideas to both reward-model verification and RL-based policy tuning, demonstrating fairer reward distributions and improved, or at least preserved, task performance across multiple benchmarks. The approach offers a transferable, bias-agnostic framework for mitigating diverse reward biases without tailoring methods to each bias type, with potential broad impact on RLHF systems.

Abstract

Rewards serve as proxies for human preferences and play a crucial role in Reinforcement Learning from Human Feedback (RLHF). However, if these rewards are inherently imperfect, exhibiting various biases, they can adversely affect the alignment of large language models (LLMs). In this paper, we collectively define the various biases present in rewards as the problem of reward unfairness. We propose a bias-agnostic method to address the issue of reward fairness from a resource allocation perspective, without specifically designing for each type of bias, yet effectively mitigating them. Specifically, we model preference learning as a resource allocation problem, treating rewards as resources to be allocated while considering the trade-off between utility and fairness in their distribution. We propose two methods, Fairness Regularization and Fairness Coefficient, to achieve fairness in rewards. We apply our methods in both verification and reinforcement learning scenarios to obtain a fairness reward model and a policy model, respectively. Experiments conducted in these scenarios demonstrate that our approach aligns LLMs with human preferences in a more fair manner.

Towards Reward Fairness in RLHF: From a Resource Allocation Perspective

TL;DR

This work identifies reward unfairness in RLHF as a barrier to aligning LLMs with human preferences. It recasts preference learning as a resource-allocation problem, introducing a unified fairness metric and two practical methods—Fairness Regularization and Fairness Coefficient—to balance utility and fairness in reward distributions. The authors apply these ideas to both reward-model verification and RL-based policy tuning, demonstrating fairer reward distributions and improved, or at least preserved, task performance across multiple benchmarks. The approach offers a transferable, bias-agnostic framework for mitigating diverse reward biases without tailoring methods to each bias type, with potential broad impact on RLHF systems.

Abstract

Rewards serve as proxies for human preferences and play a crucial role in Reinforcement Learning from Human Feedback (RLHF). However, if these rewards are inherently imperfect, exhibiting various biases, they can adversely affect the alignment of large language models (LLMs). In this paper, we collectively define the various biases present in rewards as the problem of reward unfairness. We propose a bias-agnostic method to address the issue of reward fairness from a resource allocation perspective, without specifically designing for each type of bias, yet effectively mitigating them. Specifically, we model preference learning as a resource allocation problem, treating rewards as resources to be allocated while considering the trade-off between utility and fairness in their distribution. We propose two methods, Fairness Regularization and Fairness Coefficient, to achieve fairness in rewards. We apply our methods in both verification and reinforcement learning scenarios to obtain a fairness reward model and a policy model, respectively. Experiments conducted in these scenarios demonstrate that our approach aligns LLMs with human preferences in a more fair manner.

Paper Structure

This paper contains 41 sections, 11 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Rewards unfair problem in RLHF.
  • Figure 2: Objective of Fairness Rewards. $\mathcal{D}_i$ and $\mathcal{D}_j$ represent different data. Fairness rewards aim to obtain rewards that consider the trade-off between utility and fairness. Utility refers to the ability of the rewards to distinguish between preferred and dispreferred responses, as illustrated in the top-right figure. Fairness refers to the consistent distribution of rewards across different data, as depicted in the bottom-right figure.
  • Figure 3: BT rewards on HH-RLHF.
  • Figure 4: FR rewards on HH-RLHF.
  • Figure 5: FC rewards on HH-RLHF.
  • ...and 13 more figures