Table of Contents
Fetching ...

Jackpot! Alignment as a Maximal Lottery

Roberto-Rafael Maura-Rivero, Marc Lanctot, Francesco Visin, Kate Larson

TL;DR

This work reframes AI alignment as a probabilistic social choice problem and shows that standard RLHF, which approximates a Borda-like aggregation, can fail major democratic properties. It introduces Maximal Lotteries as a principled alignment rule that is Condorcet- and majority-consistent and robust to irrelevant alternatives, and connects its objective to Nash Learning from Human Feedback with a key indifference term that accounts for non-strict preferences. The authors formalize an ML-based training objective that realizes a Maximal Lottery, relate NLHF to ML, and validate the approach with synthetic experiments across majority, IIA, and cyclic-preference scenarios, where ML outperforms RLHF. The results suggest a promising direction for more robust, human-valued alignment by leveraging Social Choice Theory principles, with extensions to online, context-aware, and demographic-aware settings in future work.

Abstract

Reinforcement Learning from Human Feedback (RLHF), the standard for aligning Large Language Models (LLMs) with human values, is known to fail to satisfy properties that are intuitively desirable, such as respecting the preferences of the majority \cite{ge2024axioms}. To overcome these issues, we propose the use of a probabilistic Social Choice rule called \emph{maximal lotteries} as a replacement for RLHF. We show that a family of alignment techniques, namely Nash Learning from Human Feedback (NLHF) \cite{munos2023nash} and variants, approximate maximal lottery outcomes and thus inherit its beneficial properties. We confirm experimentally that our proposed methodology handles situations that arise when working with preferences more robustly than standard RLHF, including supporting the preferences of the majority, providing principled ways of handling non-transitivities in the preference data, and robustness to irrelevant alternatives. This results in systems that better incorporate human values and respect human intentions.

Jackpot! Alignment as a Maximal Lottery

TL;DR

This work reframes AI alignment as a probabilistic social choice problem and shows that standard RLHF, which approximates a Borda-like aggregation, can fail major democratic properties. It introduces Maximal Lotteries as a principled alignment rule that is Condorcet- and majority-consistent and robust to irrelevant alternatives, and connects its objective to Nash Learning from Human Feedback with a key indifference term that accounts for non-strict preferences. The authors formalize an ML-based training objective that realizes a Maximal Lottery, relate NLHF to ML, and validate the approach with synthetic experiments across majority, IIA, and cyclic-preference scenarios, where ML outperforms RLHF. The results suggest a promising direction for more robust, human-valued alignment by leveraging Social Choice Theory principles, with extensions to online, context-aware, and demographic-aware settings in future work.

Abstract

Reinforcement Learning from Human Feedback (RLHF), the standard for aligning Large Language Models (LLMs) with human values, is known to fail to satisfy properties that are intuitively desirable, such as respecting the preferences of the majority \cite{ge2024axioms}. To overcome these issues, we propose the use of a probabilistic Social Choice rule called \emph{maximal lotteries} as a replacement for RLHF. We show that a family of alignment techniques, namely Nash Learning from Human Feedback (NLHF) \cite{munos2023nash} and variants, approximate maximal lottery outcomes and thus inherit its beneficial properties. We confirm experimentally that our proposed methodology handles situations that arise when working with preferences more robustly than standard RLHF, including supporting the preferences of the majority, providing principled ways of handling non-transitivities in the preference data, and robustness to irrelevant alternatives. This results in systems that better incorporate human values and respect human intentions.

Paper Structure

This paper contains 42 sections, 4 theorems, 35 equations, 2 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Let $\mathcal{Y}$ be the set of all possible statements up to a finite maximum length $L$. Let $\pi$ and $\pi^{\prime}$ represent two policies (i.e., LLMs). For two statements $a, b \in \mathcal{Y}$, let $P(a~\succ~b)$ be the probability that a random individual picked uniformly from society prefers is the Maximal Lottery for the Social Choice problem defined by the set of alternatives $\mathcal{Y

Figures (2)

  • Figure 1: Although B is the option preferred by the majority, LLMs aligned with RLHF fail to capture that, returning R. Thus, RLHF violates major democratic properties such as majority rule, while methods that emulate Maximal Lotteries satisfy them.
  • Figure 2: Simulations results Left column: Populations simulate preferences from \ref{['fig:example-image']} and \ref{['tab:IIA_3']}$(2x(\textcolor{red}{R} \succ \textcolor{green}{G} \succ \textcolor{blue}{B}), 3x(\textcolor{blue}{B} \succ \textcolor{red}{R} \succ \textcolor{green}{G}))$. Middle column: Populations mirror \ref{['tab:IIA_2']} preferences $(2x(\textcolor{red}{R} \succ \textcolor{blue}{B}), 3x(\textcolor{blue}{B} \succ \textcolor{red}{R}))$. Right column: Populations exhibit cyclical preferences from \ref{['tab:rock_paper_scissors']}$(1x(\textcolor{red}{R} \succ \textcolor{green}{G} \succ \textcolor{blue}{B}), 1x(\textcolor{green}{G} \succ \textcolor{blue}{B} \succ \textcolor{red}{R}), 1x(\textcolor{blue}{B} \succ \textcolor{red}{R} \succ \textcolor{green}{G}))$.

Theorems & Definitions (10)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Theorem 1
  • Corollary 1.1
  • Theorem
  • proof
  • Theorem 2: BTL Identifies Borda Count
  • proof