Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework
Kihyun Kim, Jiawei Zhang, Asuman Ozdaglar, Pablo A. Parrilo
TL;DR
This work addresses the misalignment risk of traditional preference-learning methods, which often overrepresent majority opinions and are vulnerable to manipulation. It introduces population-proportional alignment (PPA), an axiomatic framework that infers feasible evaluator population distributions from pairwise comparisons and enforces monotonicity, Pareto efficiency, PPA, and PBM. A novel algorithmic approach outputs policies with provable guarantees, including a softmax relaxation that balances population-proportional alignment with Condorcet-consistent outcomes. Empirical results on tabular recommendation and instruction-tuned LLMs demonstrate meaningful trade-offs between win rate and population-proportional alignment, along with robustness to manipulation and scalability to high-dimensional settings.
Abstract
Conventional preference learning methods often prioritize opinions held more widely when aggregating preferences from multiple evaluators. This may result in policies that are biased in favor of some types of opinions or groups and susceptible to strategic manipulation. To address this issue, we develop a novel preference learning framework capable of aligning aggregate opinions and policies proportionally with the true population distribution of evaluator preferences. Grounded in social choice theory, our approach infers the feasible set of evaluator population distributions directly from pairwise comparison data. Using these estimates, the algorithm constructs a policy that satisfies foundational axioms from social choice theory, namely monotonicity and Pareto efficiency, as well as our newly-introduced axioms of population-proportional alignment and population-bounded manipulability. Moreover, we propose a soft-max relaxation method that smoothly trade-offs population-proportional alignment with the selection of the Condorcet winner (which beats all other options in pairwise comparisons). Finally, we validate the effectiveness and scalability of our approach through experiments on both tabular recommendation tasks and large language model alignment.
