Table of Contents
Fetching ...

Pareto-Optimal Learning from Preferences with Hidden Context

Ryan Bahlous-Boldi, Li Ding, Lee Spector, Scott Niekum

TL;DR

This work addresses the problem of aligning AI systems to diverse human values when preferences originate from multiple hidden-context groups, which can undermine single-point reward learning. It introduces Pareto Optimal Preference Learning (POPL), a framework that uses lexicase selection to generate a diverse set of Pareto-optimal reward functions or policies that cater to distinct hidden-context groups without requiring group labels. The authors prove that optimal policies for hidden-context groups are Pareto-optimal with respect to the full preference set and demonstrate POPL’s effectiveness across stateless reward inference, Minigrid, Metaworld, and large-language-model fine-tuning, outperforming strong baselines in catering to diverse values. POPL offers a principled route to pluralistic alignment and fairness, scalable to high-dimensional sequential domains, while noting limitations in gradient-based optimization and stressing reproducibility through released code. The work thus advances safe and equitable AI by enabling robust, group-aware preference learning without explicit group identification.

Abstract

Ensuring AI models align with human values is essential for their safety and functionality. Reinforcement learning from human feedback (RLHF) leverages human preferences to achieve this alignment. However, when preferences are sourced from diverse populations, point estimates of reward can result in suboptimal performance or be unfair to specific groups. We propose Pareto Optimal Preference Learning (POPL), which enables pluralistic alignment by framing discrepant group preferences as objectives with potential trade-offs, aiming for policies that are Pareto-optimal on the preference dataset. POPL utilizes lexicase selection, an iterative process that selects diverse and Pareto-optimal solutions. Our theoretical and empirical evaluations demonstrate that POPL surpasses baseline methods in learning sets of reward functions and policies, effectively catering to distinct groups without access to group numbers or membership labels. We verify the performance of POPL on a stateless preference learning setting, a Minigrid RL domain, Metaworld robotics benchmarks, as well as large language model (LLM) fine-tuning. We illustrate that POPL can also serve as a foundation for techniques optimizing specific notions of group fairness, ensuring safe and equitable AI model alignment.

Pareto-Optimal Learning from Preferences with Hidden Context

TL;DR

This work addresses the problem of aligning AI systems to diverse human values when preferences originate from multiple hidden-context groups, which can undermine single-point reward learning. It introduces Pareto Optimal Preference Learning (POPL), a framework that uses lexicase selection to generate a diverse set of Pareto-optimal reward functions or policies that cater to distinct hidden-context groups without requiring group labels. The authors prove that optimal policies for hidden-context groups are Pareto-optimal with respect to the full preference set and demonstrate POPL’s effectiveness across stateless reward inference, Minigrid, Metaworld, and large-language-model fine-tuning, outperforming strong baselines in catering to diverse values. POPL offers a principled route to pluralistic alignment and fairness, scalable to high-dimensional sequential domains, while noting limitations in gradient-based optimization and stressing reproducibility through released code. The work thus advances safe and equitable AI by enabling robust, group-aware preference learning without explicit group identification.

Abstract

Ensuring AI models align with human values is essential for their safety and functionality. Reinforcement learning from human feedback (RLHF) leverages human preferences to achieve this alignment. However, when preferences are sourced from diverse populations, point estimates of reward can result in suboptimal performance or be unfair to specific groups. We propose Pareto Optimal Preference Learning (POPL), which enables pluralistic alignment by framing discrepant group preferences as objectives with potential trade-offs, aiming for policies that are Pareto-optimal on the preference dataset. POPL utilizes lexicase selection, an iterative process that selects diverse and Pareto-optimal solutions. Our theoretical and empirical evaluations demonstrate that POPL surpasses baseline methods in learning sets of reward functions and policies, effectively catering to distinct groups without access to group numbers or membership labels. We verify the performance of POPL on a stateless preference learning setting, a Minigrid RL domain, Metaworld robotics benchmarks, as well as large language model (LLM) fine-tuning. We illustrate that POPL can also serve as a foundation for techniques optimizing specific notions of group fairness, ensuring safe and equitable AI model alignment.
Paper Structure (37 sections, 2 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 37 sections, 2 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: An outline of the proposed Pareto Optimal Preference Learning (POPL) framework. Given a set of pairwise preferences over trajectory segments from groups with potentially different ground truth reward functions, we infer a set of reward functions or policies that captures each group's ground truth, without group membership labels. To do this, we frame reward inference as multi-objective optimization, where each preference forms a single objective, and find a set of Pareto-optimal reward functions or policies.
  • Figure 2: An example of a situation where using POPL is preferable to using a Marginalized Distributional Preference Learning (MDPL) system. Due to the fact that these systems marginalize over the hidden context $z$ for each state, MDPLs are unable to be sensitive to persistent annotator identity. MDPLs represent the distribution of utility values in a column-wise fashion, or maintain a distribution of utilities for each state, that is decoupled from that for other states. Therefore, the utility for both groups of the trajectory $AB$ is indistinguishable from that for $BC$ by an MDPL. POPL, on the other hand, represents the distribution row-wise, finding a set of utility functions that should include the ground truth for each group. In this case, POPL can represent the fact that $AB$ is an unfair trajectory and $BC$ is fair, whereas MDPLs are unable to make this distinction.
  • Figure 3: Lexicase selection being used to select a single candidate hypothesis. Starting with a random ordering, the pool of reward hypotheses is filtered down based on the preferences in order, until a single individual remains or we run out of preferences. The resulting reward function is added to the next pool, and this process is repeated (with new shuffles) to fill the population.
  • Figure 4: (a) and (b) show the catered reward functions for each of the two hidden context groups $z=0$, $z=1$. From a set of reward functions that is inferred from a diversity of human preferences, we select a single reward function for each unique group with a small number of preferences (2% the size of the training set). POPL is able to cater for both groups, while B-REx is only able to cater for one of the two groups ($z=1$, red line). For B-REx, the $z=0$ (green) group's catered reward function doesn't capture the fact that any state $a<0.8$ is preferred to any state $a\geq 0.8$.
  • Figure 5: Minigrid experiments. Plots in (c), (d), (e) and (f) show average state occupancy for policies catered for each hidden context group. POPL is able to cater distinct policies for each group, while MultiCPL collpases to a single group's preferences.
  • ...and 2 more figures