Group Robust Preference Optimization in Reward-free RLHF
Shyam Sundhar Ramesh, Yifan Hu, Iason Chaimalas, Viraj Mehta, Pier Giuseppe Sessa, Haitham Bou Ammar, Ilija Bogunovic
TL;DR
The paper tackles biased alignment of LLMs when preferences come from diverse user groups. It introduces GRPO, a robust, reward-free preference optimization framework that maximizes worst-case group performance by adaptively weighting groups during policy fine-tuning, with both GR-DPO and GR-IPO variants. The authors establish theoretical convergence for a log-linear policy and provide a practical alternating-update algorithm (mirror descent) with convergence guarantees, complemented by empirical results on synthetic and real-world global opinion data showing improved worst-group performance and reduced disparity. Overall, GRPO offers a principled approach to equitable LLM alignment across heterogeneous user groups and demonstrates the potential to mitigate bias arising from group imbalances in preference data.
Abstract
Adapting large language models (LLMs) for specific tasks usually involves fine-tuning through reinforcement learning with human feedback (RLHF) on preference data. While these data often come from diverse labelers' groups (e.g., different demographics, ethnicities, company teams, etc.), traditional RLHF approaches adopt a "one-size-fits-all" approach, i.e., they indiscriminately assume and optimize a single preference model, thus not being robust to unique characteristics and needs of the various groups. To address this limitation, we propose a novel Group Robust Preference Optimization (GRPO) method to align LLMs to individual groups' preferences robustly. Our approach builds upon reward-free direct preference optimization methods, but unlike previous approaches, it seeks a robust policy which maximizes the worst-case group performance. To achieve this, GRPO adaptively and sequentially weights the importance of different groups, prioritizing groups with worse cumulative loss. We theoretically study the feasibility of GRPO and analyze its convergence for the log-linear policy class. By fine-tuning LLMs with GRPO using diverse group-based global opinion data, we significantly improved performance for the worst-performing groups, reduced loss imbalances across groups, and improved probability accuracies compared to non-robust baselines.
