Table of Contents
Fetching ...

PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences

Daiwei Chen, Yi Chen, Aniket Rege, Ramya Korlakai Vinayak

TL;DR

PAL introduces a pluralistic alignment framework that models human preferences as a convex combination of $K$ prototypes in a latent space, reframing reward learning via an ideal point and distance-based metric approach. By sharing a mapping $f$ and a small set of prototypes across the population, PAL can personalize at the user level through weights in the simplex $ riangle^K$, enabling few-shot adaptation to unseen users without fine-tuning large foundation models. The method uses lightweight two-layer MLP reward models on top of base representations and demonstrates competitive or superior performance across language and vision tasks on synthetic, semi-synthetic, and real datasets, while also revealing biases in rigid rubric datasets. PAL thus provides a scalable, interpretable path toward capturing diverse human preferences and efficient reward modeling, with potential broad impact and avenues for continual learning to accommodate new user groups.

Abstract

Large foundation models pretrained on raw web-scale data are not readily deployable without additional step of extensive alignment to human preferences. Such alignment is typically done by collecting large amounts of pairwise comparisons from humans ("Do you prefer output A or B?") and learning a reward model or a policy with the Bradley-Terry-Luce (BTL) model as a proxy for a human's underlying implicit preferences. These methods generally suffer from assuming a universal preference shared by all humans, which lacks the flexibility of adapting to plurality of opinions and preferences. In this work, we propose PAL, a framework to model human preference complementary to existing pretraining strategies, which incorporates plurality from the ground up. We propose using the ideal point model as a lens to view alignment using preference comparisons. Together with our novel reformulation and using mixture modeling, our framework captures the plurality of population preferences while simultaneously learning a common preference latent space across different preferences, which can few-shot generalize to new, unseen users. Our approach enables us to use the penultimate-layer representation of large foundation models and simple MLP layers to learn reward functions that are on-par with the existing large state-of-the-art reward models, thereby enhancing efficiency of reward modeling significantly. We show that PAL achieves competitive reward model accuracy compared to strong baselines on 1) Language models with Summary dataset ; 2) Image Generative models with Pick-a-Pic dataset ; 3) A new semisynthetic heterogeneous dataset generated using Anthropic Personas. Finally, our experiments also highlight the shortcoming of current preference datasets that are created using rigid rubrics which wash away heterogeneity, and call for more nuanced data collection approaches.

PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences

TL;DR

PAL introduces a pluralistic alignment framework that models human preferences as a convex combination of prototypes in a latent space, reframing reward learning via an ideal point and distance-based metric approach. By sharing a mapping and a small set of prototypes across the population, PAL can personalize at the user level through weights in the simplex , enabling few-shot adaptation to unseen users without fine-tuning large foundation models. The method uses lightweight two-layer MLP reward models on top of base representations and demonstrates competitive or superior performance across language and vision tasks on synthetic, semi-synthetic, and real datasets, while also revealing biases in rigid rubric datasets. PAL thus provides a scalable, interpretable path toward capturing diverse human preferences and efficient reward modeling, with potential broad impact and avenues for continual learning to accommodate new user groups.

Abstract

Large foundation models pretrained on raw web-scale data are not readily deployable without additional step of extensive alignment to human preferences. Such alignment is typically done by collecting large amounts of pairwise comparisons from humans ("Do you prefer output A or B?") and learning a reward model or a policy with the Bradley-Terry-Luce (BTL) model as a proxy for a human's underlying implicit preferences. These methods generally suffer from assuming a universal preference shared by all humans, which lacks the flexibility of adapting to plurality of opinions and preferences. In this work, we propose PAL, a framework to model human preference complementary to existing pretraining strategies, which incorporates plurality from the ground up. We propose using the ideal point model as a lens to view alignment using preference comparisons. Together with our novel reformulation and using mixture modeling, our framework captures the plurality of population preferences while simultaneously learning a common preference latent space across different preferences, which can few-shot generalize to new, unseen users. Our approach enables us to use the penultimate-layer representation of large foundation models and simple MLP layers to learn reward functions that are on-par with the existing large state-of-the-art reward models, thereby enhancing efficiency of reward modeling significantly. We show that PAL achieves competitive reward model accuracy compared to strong baselines on 1) Language models with Summary dataset ; 2) Image Generative models with Pick-a-Pic dataset ; 3) A new semisynthetic heterogeneous dataset generated using Anthropic Personas. Finally, our experiments also highlight the shortcoming of current preference datasets that are created using rigid rubrics which wash away heterogeneity, and call for more nuanced data collection approaches.
Paper Structure (31 sections, 7 equations, 15 figures, 6 tables, 2 algorithms)

This paper contains 31 sections, 7 equations, 15 figures, 6 tables, 2 algorithms.

Figures (15)

  • Figure 1: Illustration of ${\rm PAL}$ framework for learning from diverse preferences (Section \ref{['sec:models']}). For any user $i$, the probability of preferring $\mathbf{x}_{l}$ to $\mathbf{x}_{r}$ for the context $\mathbf{x}_c$ is given by a reward model $r_\theta^{(i)}$ which is modeled as a uses a mixture modeling approach to capture diverse user preferences -- each user's preference is modeled as a convex combination of $K$ prototypes. Reward function formulated using ${\rm PAL}$ framework can be used flexibly, e.g, with fixed preference points (Model A), with preference points that are functions of the context/prompt $\mathbf{x}_{c}$ (Model B).
  • Figure 2: The performance of model A on the simulation datasets with $d=16$, $K=\{1,2,3,4,5\}$, $K^*=\{2,3,4\}$, $N=50*K^*$, and mixture user ideal point setting. For the fig 2(a) visualization, we set $d=2$, $K=3$, $K^*=3$.
  • Figure 3: An example of pairwise comparison query with a prompt from our heterogeneous persona dataset generated using Anthropic's Persona. A synthetic user who is assigned with a persona of interest in art will have a ground truth of $y = -1$ by answering no, whereas a user who is assigned with interest in math pairs with a ground truth of $y = +1$ by answering yes.
  • Figure 4: Seen accuracy (a,b,c) and unseen accuracy (d) evaluated on the heterogeneous persona dataset across the number of prototypical groups used in the model $K$. We vary (a) the number of true prototypes $K^\star$, (b) the number of comparisons per seen user $n_p$, (c) the size of latent dimension $d$, (d) the number of comparisons per unseen user $n_{p, \text{unseen}}$.
  • Figure 5: ${\rm PAL}$ Model B test accuracy on Pick-a-Filter compared to CLIP-H.
  • ...and 10 more figures