Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models

Kefan Song; Jin Yao; Runnan Jiang; Rohan Chandra; Shangtong Zhang

Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models

Kefan Song, Jin Yao, Runnan Jiang, Rohan Chandra, Shangtong Zhang

TL;DR

The paper tackles group fairness in large language models by shifting focus from final outputs to the reward-model stage of RLHF. It introduces a benchmark based on arXiv metadata to evaluate group fairness across eight disciplinary groups, using ANOVA and Tukey HSD tests and a Normalized Maximum Group Difference metric. Empirical results show pervasive, statistically significant group unfairness across eight reward models, with top-performing models still exhibiting disparities, though some display better fairness. The work highlights the need to address fairness at the reward-model level to ensure LLMs benefit diverse user groups equitably.

Abstract

As Large Language Models (LLMs) become increasingly powerful and accessible to human users, ensuring fairness across diverse demographic groups, i.e., group fairness, is a critical ethical concern. However, current fairness and bias research in LLMs is limited in two aspects. First, compared to traditional group fairness in machine learning classification, it requires that the non-sensitive attributes, in this case, the prompt questions, be the same across different groups. In many practical scenarios, different groups, however, may prefer different prompt questions and this requirement becomes impractical. Second, it evaluates group fairness only for the LLM's final output without identifying the source of possible bias. Namely, the bias in LLM's output can result from both the pretraining and the finetuning. For finetuning, the bias can result from both the RLHF procedure and the learned reward model. Arguably, evaluating the group fairness of each component in the LLM pipeline could help develop better methods to mitigate the possible bias. Recognizing those two limitations, this work benchmarks the group fairness of learned reward models. By using expert-written text from arXiv, we are able to benchmark the group fairness of reward models without requiring the same prompt questions across different demographic groups. Surprisingly, our results demonstrate that all the evaluated reward models (e.g., Nemotron-4-340B-Reward, ArmoRM-Llama3-8B-v0.1, and GRM-llama3-8B-sftreg) exhibit statistically significant group unfairness. We also observed that top-performing reward models (w.r.t. canonical performance metrics) tend to demonstrate better group fairness.

Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models

TL;DR

Abstract

Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)