Table of Contents
Fetching ...

Group Robust Preference Optimization in Reward-free RLHF

Shyam Sundhar Ramesh, Yifan Hu, Iason Chaimalas, Viraj Mehta, Pier Giuseppe Sessa, Haitham Bou Ammar, Ilija Bogunovic

TL;DR

The paper tackles biased alignment of LLMs when preferences come from diverse user groups. It introduces GRPO, a robust, reward-free preference optimization framework that maximizes worst-case group performance by adaptively weighting groups during policy fine-tuning, with both GR-DPO and GR-IPO variants. The authors establish theoretical convergence for a log-linear policy and provide a practical alternating-update algorithm (mirror descent) with convergence guarantees, complemented by empirical results on synthetic and real-world global opinion data showing improved worst-group performance and reduced disparity. Overall, GRPO offers a principled approach to equitable LLM alignment across heterogeneous user groups and demonstrates the potential to mitigate bias arising from group imbalances in preference data.

Abstract

Adapting large language models (LLMs) for specific tasks usually involves fine-tuning through reinforcement learning with human feedback (RLHF) on preference data. While these data often come from diverse labelers' groups (e.g., different demographics, ethnicities, company teams, etc.), traditional RLHF approaches adopt a "one-size-fits-all" approach, i.e., they indiscriminately assume and optimize a single preference model, thus not being robust to unique characteristics and needs of the various groups. To address this limitation, we propose a novel Group Robust Preference Optimization (GRPO) method to align LLMs to individual groups' preferences robustly. Our approach builds upon reward-free direct preference optimization methods, but unlike previous approaches, it seeks a robust policy which maximizes the worst-case group performance. To achieve this, GRPO adaptively and sequentially weights the importance of different groups, prioritizing groups with worse cumulative loss. We theoretically study the feasibility of GRPO and analyze its convergence for the log-linear policy class. By fine-tuning LLMs with GRPO using diverse group-based global opinion data, we significantly improved performance for the worst-performing groups, reduced loss imbalances across groups, and improved probability accuracies compared to non-robust baselines.

Group Robust Preference Optimization in Reward-free RLHF

TL;DR

The paper tackles biased alignment of LLMs when preferences come from diverse user groups. It introduces GRPO, a robust, reward-free preference optimization framework that maximizes worst-case group performance by adaptively weighting groups during policy fine-tuning, with both GR-DPO and GR-IPO variants. The authors establish theoretical convergence for a log-linear policy and provide a practical alternating-update algorithm (mirror descent) with convergence guarantees, complemented by empirical results on synthetic and real-world global opinion data showing improved worst-group performance and reduced disparity. Overall, GRPO offers a principled approach to equitable LLM alignment across heterogeneous user groups and demonstrates the potential to mitigate bias arising from group imbalances in preference data.

Abstract

Adapting large language models (LLMs) for specific tasks usually involves fine-tuning through reinforcement learning with human feedback (RLHF) on preference data. While these data often come from diverse labelers' groups (e.g., different demographics, ethnicities, company teams, etc.), traditional RLHF approaches adopt a "one-size-fits-all" approach, i.e., they indiscriminately assume and optimize a single preference model, thus not being robust to unique characteristics and needs of the various groups. To address this limitation, we propose a novel Group Robust Preference Optimization (GRPO) method to align LLMs to individual groups' preferences robustly. Our approach builds upon reward-free direct preference optimization methods, but unlike previous approaches, it seeks a robust policy which maximizes the worst-case group performance. To achieve this, GRPO adaptively and sequentially weights the importance of different groups, prioritizing groups with worse cumulative loss. We theoretically study the feasibility of GRPO and analyze its convergence for the log-linear policy class. By fine-tuning LLMs with GRPO using diverse group-based global opinion data, we significantly improved performance for the worst-performing groups, reduced loss imbalances across groups, and improved probability accuracies compared to non-robust baselines.
Paper Structure (25 sections, 8 theorems, 81 equations, 12 figures, 2 tables, 3 algorithms)

This paper contains 25 sections, 8 theorems, 81 equations, 12 figures, 2 tables, 3 algorithms.

Key Result

Proposition 1

Under log-linear parameterization of the policy class, there exists a Nash equilibrium for the group robust direct preference optimization problem in eq:robust-lin-obj.

Figures (12)

  • Figure 1: Current reward-free preference optimization methods typically optimize based on average human feedback. This often aligns predominantly with the preferences of the majority group (G1, R1 $>$ R2) at the expense of minority groups (G2, R2 $>$ R1). In contrast, our GRPO algorithm introduces adaptive weighting for different user groups and prioritizes optimizing for the worst-case group performance, leading to better alignment for the most disadvantaged groups.
  • Figure 2: \ref{['alg: mgpbo']} (GR-DPO and GR-IPO) leads to a lower worst-case validation loss and reward error compared to importance sampling and vanilla methods. Results refer to the scenario in which groups have different sizes but same responses' distribution. Note that the gap between \ref{['alg: mgpbo']} and importance sampling is smaller than in Figure \ref{['fig:ipo-swapped-even-imb']}. This is expected considering that the primary difference between groups arises from data imbalance, which is handled by importance sampling.
  • Figure 3: \ref{['alg: mgpbo']} (GR-DPO and GR-IPO) leads to a lower worst-case validation loss and reward error compared to the non-robust vanilla methods. Results refer to the scenario in which groups have same sizes but different responses' distribution. Unlike the setups of \ref{['fig:ipo-swapped-even-imb']} and \ref{['fig:ipo-swapped-uneven-imb']} importance sampling has no effect here (it coincides with vanilla DPO/IPO since groups have the same sizes).
  • Figure 4: Synthetic experiments: \ref{['alg: mgpbo']} (GR-DPO and GR-IPO) leads to a significantly lower worst-case validation loss and reward error compared to importance sampling (IS-DPO/IPO) and vanilla methods (DPO, IPO). Results refer to the scenario in which groups have different sizes and responses' distribution.
  • Figure 5: Global opinion data: Top plots: GR-IPO leads to better worst-case final test loss and reward accuracy compared to IPO. Moreover, it leads to more balanced losses across the different groups, reducing the gap between best and worst-group loss (Group-1 vs. Group-5). Bottom plots: Log-prob. accuracy (left plot) and group weights (middle plot) during GR-IPO training. GR-IPO increases the weight on worse-performing groups (Groups-2,5) and decreases it on high-performing ones (Groups-1,3,4), leading to better worst-case accuracy. Groups-2,5 are the ones with worse log-prob. accuracy at the beginning of training (right plot with a random subset of the training data). We show the corresponding end-of-training log-prob. accuracies for GR-IPO in \ref{['fig:ripo-real-2-app']} of \ref{['app: expts']}.
  • ...and 7 more figures

Theorems & Definitions (13)

  • Proposition 1
  • Proposition 1
  • Proposition 1
  • Lemma 2
  • proof
  • Proposition 2
  • proof
  • Proposition 2
  • proof
  • Lemma 3
  • ...and 3 more