Table of Contents
Fetching ...

Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework

Kihyun Kim, Jiawei Zhang, Asuman Ozdaglar, Pablo A. Parrilo

TL;DR

This work addresses the misalignment risk of traditional preference-learning methods, which often overrepresent majority opinions and are vulnerable to manipulation. It introduces population-proportional alignment (PPA), an axiomatic framework that infers feasible evaluator population distributions from pairwise comparisons and enforces monotonicity, Pareto efficiency, PPA, and PBM. A novel algorithmic approach outputs policies with provable guarantees, including a softmax relaxation that balances population-proportional alignment with Condorcet-consistent outcomes. Empirical results on tabular recommendation and instruction-tuned LLMs demonstrate meaningful trade-offs between win rate and population-proportional alignment, along with robustness to manipulation and scalability to high-dimensional settings.

Abstract

Conventional preference learning methods often prioritize opinions held more widely when aggregating preferences from multiple evaluators. This may result in policies that are biased in favor of some types of opinions or groups and susceptible to strategic manipulation. To address this issue, we develop a novel preference learning framework capable of aligning aggregate opinions and policies proportionally with the true population distribution of evaluator preferences. Grounded in social choice theory, our approach infers the feasible set of evaluator population distributions directly from pairwise comparison data. Using these estimates, the algorithm constructs a policy that satisfies foundational axioms from social choice theory, namely monotonicity and Pareto efficiency, as well as our newly-introduced axioms of population-proportional alignment and population-bounded manipulability. Moreover, we propose a soft-max relaxation method that smoothly trade-offs population-proportional alignment with the selection of the Condorcet winner (which beats all other options in pairwise comparisons). Finally, we validate the effectiveness and scalability of our approach through experiments on both tabular recommendation tasks and large language model alignment.

Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework

TL;DR

This work addresses the misalignment risk of traditional preference-learning methods, which often overrepresent majority opinions and are vulnerable to manipulation. It introduces population-proportional alignment (PPA), an axiomatic framework that infers feasible evaluator population distributions from pairwise comparisons and enforces monotonicity, Pareto efficiency, PPA, and PBM. A novel algorithmic approach outputs policies with provable guarantees, including a softmax relaxation that balances population-proportional alignment with Condorcet-consistent outcomes. Empirical results on tabular recommendation and instruction-tuned LLMs demonstrate meaningful trade-offs between win rate and population-proportional alignment, along with robustness to manipulation and scalability to high-dimensional settings.

Abstract

Conventional preference learning methods often prioritize opinions held more widely when aggregating preferences from multiple evaluators. This may result in policies that are biased in favor of some types of opinions or groups and susceptible to strategic manipulation. To address this issue, we develop a novel preference learning framework capable of aligning aggregate opinions and policies proportionally with the true population distribution of evaluator preferences. Grounded in social choice theory, our approach infers the feasible set of evaluator population distributions directly from pairwise comparison data. Using these estimates, the algorithm constructs a policy that satisfies foundational axioms from social choice theory, namely monotonicity and Pareto efficiency, as well as our newly-introduced axioms of population-proportional alignment and population-bounded manipulability. Moreover, we propose a soft-max relaxation method that smoothly trade-offs population-proportional alignment with the selection of the Condorcet winner (which beats all other options in pairwise comparisons). Finally, we validate the effectiveness and scalability of our approach through experiments on both tabular recommendation tasks and large language model alignment.

Paper Structure

This paper contains 67 sections, 13 theorems, 60 equations, 2 figures, 6 tables.

Key Result

Proposition 3.5

$\Phi^\mathrm{MB}$ and $\Phi^\mathrm{ML}$ violate the $\alpha$-PPA axiom for any $\alpha$ and the $\gamma$-PBM axiom for any $\gamma$. $\Phi^\mathrm{RD}$ satisfies all four axioms.

Figures (2)

  • Figure 1: Illustration of the relationships between the profile, preference function, and policy.
  • Figure 2: Tabular experiment results (Section \ref{['sec:5.1']}) for $F^\beta$, $F^\mathrm{RL}$, and $F^\mathrm{NL}$. Left: win rate (left axis, blue) and PPA level (right axis, orange). Right: PBM level (policy gain through manipulation).

Theorems & Definitions (29)

  • Definition 3.1: $\alpha$-Population-proportional alignment ($\alpha$-PPA)
  • Definition 3.2: Single-group manipulated profile
  • Definition 3.3: $\gamma$-Population-bounded manipulability ($\gamma$-PBM)
  • Definition 3.4: Random dictatorship
  • Proposition 3.5
  • Definition 4.1
  • Proposition 4.2
  • Definition 4.3
  • Theorem 4.4
  • Definition 4.5
  • ...and 19 more