Table of Contents
Fetching ...

MaxMin-RLHF: Alignment with Diverse Human Preferences

Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, Mengdi Wang

TL;DR

The paper formalizes the limitation of aligning language models with diverse human preferences using a single reward function in RLHF. It introduces MaxMin-RLHF, an egalitarian, max-min objective that operates over a learned mixture of reward functions via an EM algorithm to represent multiple subpopulations. Theoretical results show an inherent alignment gap for minority groups under single-reward RLHF, while empirical studies on GPT-2 and Tulu-7B demonstrate improved minority fairness and preserved majority performance. The approach connects to distributionally robust optimization and social choice theory, offering a broadly applicable framework for diversity-aware alignment in reinforcement learning and beyond.

Abstract

Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, such an approach overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting its insufficiency in representing diverse human preferences. To provide an equitable solution to the problem, we learn a mixture of preference distributions via an expectation-maximization algorithm and propose a MaxMin alignment objective for policy learning inspired by the Egalitarian principle in social choice theory to better represent diverse human preferences. We elucidate the connection of our proposed approach to distributionally robust optimization and general utility RL, thereby highlighting the generality and robustness of our proposed solution. We present comprehensive experimental results on small-scale (GPT-2) and large-scale language models (with Tulu2-7B) and show the efficacy of the proposed approach in the presence of diversity among human preferences. Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms and improves the win-rate (accuracy) for minority groups by over 33% without compromising the performance of majority groups, showcasing the robustness and fairness of our approach. We remark that our findings in this work are not only limited to language models but also extend to reinforcement learning in general.

MaxMin-RLHF: Alignment with Diverse Human Preferences

TL;DR

The paper formalizes the limitation of aligning language models with diverse human preferences using a single reward function in RLHF. It introduces MaxMin-RLHF, an egalitarian, max-min objective that operates over a learned mixture of reward functions via an EM algorithm to represent multiple subpopulations. Theoretical results show an inherent alignment gap for minority groups under single-reward RLHF, while empirical studies on GPT-2 and Tulu-7B demonstrate improved minority fairness and preserved majority performance. The approach connects to distributionally robust optimization and social choice theory, offering a broadly applicable framework for diversity-aware alignment in reinforcement learning and beyond.

Abstract

Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, such an approach overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting its insufficiency in representing diverse human preferences. To provide an equitable solution to the problem, we learn a mixture of preference distributions via an expectation-maximization algorithm and propose a MaxMin alignment objective for policy learning inspired by the Egalitarian principle in social choice theory to better represent diverse human preferences. We elucidate the connection of our proposed approach to distributionally robust optimization and general utility RL, thereby highlighting the generality and robustness of our proposed solution. We present comprehensive experimental results on small-scale (GPT-2) and large-scale language models (with Tulu2-7B) and show the efficacy of the proposed approach in the presence of diversity among human preferences. Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms and improves the win-rate (accuracy) for minority groups by over 33% without compromising the performance of majority groups, showcasing the robustness and fairness of our approach. We remark that our findings in this work are not only limited to language models but also extend to reinforcement learning in general.
Paper Structure (19 sections, 51 equations, 7 figures, 4 tables, 2 algorithms)

This paper contains 19 sections, 51 equations, 7 figures, 4 tables, 2 algorithms.

Figures (7)

  • Figure 1: This figure highlights the drawbacks of a single reward-based current state-of-the-art alignment framework called Reinforcement Learning from Human Feedback (RLHF) christian2020alignment. In this figure, we demonstrate a setting where, due to the inherent presence of majority and minority user groups who provide human feedback, single reward-based RLHF alignment would align the language model towards the majority group while completely ignoring the minority use group preferences. We provide a theoretical justification in Section \ref{['impossibility']} and empirical evidence in Section \ref{['experiments']}.
  • Figure 2: (Diversity in Preferences.) This figure illustrates the diversity in preferences among two distinct human groups using the IMDB movie review dataset maas-EtAl:2011:ACL-HLT2011. We categorize these groups as 'majority' and 'minority.' (a) and (c) display minority sentiment and conciseness preferences. We note that the minority group strongly favors concise responses (as seen in the blue curve in (c)), while showing indifference towards sentiment (as indicated by overlapping curves in (a)). In contrast, (b) and (d) depict that the majority clearly prioritizes positive sentiment (as evidenced by a significant gap between chosen and rejected trajectories in (b)), while displaying little concern for conciseness (as indicated by overlapping curves in (d)).
  • Figure 3: (Empirical Evidence of Impossibility). This figure validates our theoretical results in Theorem \ref{['theorem_2_impossibility']} and provides empirical evidence of the impossibility of alignment in single reward RLHF on preference dataset presented in Figure \ref{['fig:diversity']}. Here, the task is to align the LLM to generate positive sentiment responses which are concise. We note that the aligned language model can generate highly positive sentiment sentences but completely ignores the requirement of conciseness. This is happening because the humans who prefer conciseness are in the minority as compared to humans who prefer a positive sentiment score as described in Figure \ref{['fig:diversity']}.
  • Figure 4: (Alignment with MaxMin RLHF). This figure shows the performance of our proposed MaxMin RLHF algorithm for the preference dataset described in Figure \ref{['fig:diversity']}. The task is to align a language model to generate positive sentiment responses that are concise (of shorter token length) in nature. We note that MaxMin-RLHF aligned language model can generate highly positive sentiment sentences and satisfy the conciseness criteria. This shows alignment with both the majority and minority preferences.
  • Figure 5: This figure shows the average performance in terms of sentiments of the generated output and the conciseness alignment.We note that MaxMin RLHF is able to better cater to both the alignment criteria as compared to single reward RLHF as expected.
  • ...and 2 more figures