Pareto-Optimal Learning from Preferences with Hidden Context
Ryan Bahlous-Boldi, Li Ding, Lee Spector, Scott Niekum
TL;DR
This work addresses the problem of aligning AI systems to diverse human values when preferences originate from multiple hidden-context groups, which can undermine single-point reward learning. It introduces Pareto Optimal Preference Learning (POPL), a framework that uses lexicase selection to generate a diverse set of Pareto-optimal reward functions or policies that cater to distinct hidden-context groups without requiring group labels. The authors prove that optimal policies for hidden-context groups are Pareto-optimal with respect to the full preference set and demonstrate POPL’s effectiveness across stateless reward inference, Minigrid, Metaworld, and large-language-model fine-tuning, outperforming strong baselines in catering to diverse values. POPL offers a principled route to pluralistic alignment and fairness, scalable to high-dimensional sequential domains, while noting limitations in gradient-based optimization and stressing reproducibility through released code. The work thus advances safe and equitable AI by enabling robust, group-aware preference learning without explicit group identification.
Abstract
Ensuring AI models align with human values is essential for their safety and functionality. Reinforcement learning from human feedback (RLHF) leverages human preferences to achieve this alignment. However, when preferences are sourced from diverse populations, point estimates of reward can result in suboptimal performance or be unfair to specific groups. We propose Pareto Optimal Preference Learning (POPL), which enables pluralistic alignment by framing discrepant group preferences as objectives with potential trade-offs, aiming for policies that are Pareto-optimal on the preference dataset. POPL utilizes lexicase selection, an iterative process that selects diverse and Pareto-optimal solutions. Our theoretical and empirical evaluations demonstrate that POPL surpasses baseline methods in learning sets of reward functions and policies, effectively catering to distinct groups without access to group numbers or membership labels. We verify the performance of POPL on a stateless preference learning setting, a Minigrid RL domain, Metaworld robotics benchmarks, as well as large language model (LLM) fine-tuning. We illustrate that POPL can also serve as a foundation for techniques optimizing specific notions of group fairness, ensuring safe and equitable AI model alignment.
