Table of Contents
Fetching ...

Statistical Impossibility and Possibility of Aligning LLMs with Human Preferences: From Condorcet Paradox to Nash Equilibrium

Kaizhao Liu, Qi Long, Zhekun Shi, Weijie J. Su, Jiancong Xiao

TL;DR

This work uncovers fundamental statistical limits in aligning LLMs to human preferences, showing that reward-based models cannot capture distributions when Condorcet cycles arise, with cycles occurring with high probability under a probabilistic labeling model. It then demonstrates that a non-reward-based approach, NLHF, can preserve minority preferences by producing mixed Nash equilibria whenever there is no Condorcet winner, a property that holds with high probability as the number of responses grows. The authors introduce Nash Rejection Sampling (Nash-RS), a single-loop algorithm to compute NLHF equilibria, and provide empirical validation on Llama-3.2-1B that achieves a 60.55% win rate against the base model, supporting NLHF as a statistically favorable alternative to RLHF for maintaining diverse human preferences. The findings offer practical paths toward more fair and diverse alignment, with implications for policy design, evaluation, and future preference-optimization methods in large language models.

Abstract

Aligning large language models (LLMs) with diverse human preferences is critical for ensuring fairness and informed outcomes when deploying these models for decision-making. In this paper, we seek to uncover fundamental statistical limits concerning aligning LLMs with human preferences, with a focus on the probabilistic representation of human preferences and the preservation of diverse preferences in aligned LLMs. We first show that human preferences can be represented by a reward model if and only if the preference among LLM-generated responses is free of any Condorcet cycle. Moreover, we prove that Condorcet cycles exist with probability converging to one exponentially fast under a probabilistic preference model, thereby demonstrating the impossibility of fully aligning human preferences using reward-based approaches such as reinforcement learning from human feedback. Next, we explore the conditions under which LLMs would employ mixed strategies -- meaning they do not collapse to a single response -- when aligned in the limit using a non-reward-based approach, such as Nash learning from human feedback (NLHF). We identify a necessary and sufficient condition for mixed strategies: the absence of a response that is preferred over all others by a majority. As a blessing, we prove that this condition holds with high probability under the probabilistic preference model, thereby highlighting the statistical possibility of preserving minority preferences without explicit regularization in aligning LLMs. Finally, we leverage insights from our statistical results to design a novel, computationally efficient algorithm for finding Nash equilibria in aligning LLMs with NLHF. Our experiments show that Llama-3.2-1B, aligned with our algorithm, achieves a win rate of 60.55\% against the base model.

Statistical Impossibility and Possibility of Aligning LLMs with Human Preferences: From Condorcet Paradox to Nash Equilibrium

TL;DR

This work uncovers fundamental statistical limits in aligning LLMs to human preferences, showing that reward-based models cannot capture distributions when Condorcet cycles arise, with cycles occurring with high probability under a probabilistic labeling model. It then demonstrates that a non-reward-based approach, NLHF, can preserve minority preferences by producing mixed Nash equilibria whenever there is no Condorcet winner, a property that holds with high probability as the number of responses grows. The authors introduce Nash Rejection Sampling (Nash-RS), a single-loop algorithm to compute NLHF equilibria, and provide empirical validation on Llama-3.2-1B that achieves a 60.55% win rate against the base model, supporting NLHF as a statistically favorable alternative to RLHF for maintaining diverse human preferences. The findings offer practical paths toward more fair and diverse alignment, with implications for policy design, evaluation, and future preference-optimization methods in large language models.

Abstract

Aligning large language models (LLMs) with diverse human preferences is critical for ensuring fairness and informed outcomes when deploying these models for decision-making. In this paper, we seek to uncover fundamental statistical limits concerning aligning LLMs with human preferences, with a focus on the probabilistic representation of human preferences and the preservation of diverse preferences in aligned LLMs. We first show that human preferences can be represented by a reward model if and only if the preference among LLM-generated responses is free of any Condorcet cycle. Moreover, we prove that Condorcet cycles exist with probability converging to one exponentially fast under a probabilistic preference model, thereby demonstrating the impossibility of fully aligning human preferences using reward-based approaches such as reinforcement learning from human feedback. Next, we explore the conditions under which LLMs would employ mixed strategies -- meaning they do not collapse to a single response -- when aligned in the limit using a non-reward-based approach, such as Nash learning from human feedback (NLHF). We identify a necessary and sufficient condition for mixed strategies: the absence of a response that is preferred over all others by a majority. As a blessing, we prove that this condition holds with high probability under the probabilistic preference model, thereby highlighting the statistical possibility of preserving minority preferences without explicit regularization in aligning LLMs. Finally, we leverage insights from our statistical results to design a novel, computationally efficient algorithm for finding Nash equilibria in aligning LLMs with NLHF. Our experiments show that Llama-3.2-1B, aligned with our algorithm, achieves a win rate of 60.55\% against the base model.

Paper Structure

This paper contains 71 sections, 35 theorems, 159 equations, 5 figures, 7 tables, 3 algorithms.

Key Result

Theorem 2.1

For any set of responses $\{y_1,\ldots,y_n\}$ with $n\geqslant 3$ and any preference $\mathcal{P}(y\succ y')$ defined on this set, there exists a reward model $r(y)$ that captures the preference $\mathcal{P}(y\succ y')$ if and only if there is no Condorcet cycle in the set of responses.

Figures (5)

  • Figure 1: Illustration of statistical impossibility and possibility of aligning LLMs with human preferences. Reward models are impossible to capture human preferences with Condorcet cycles, while non-reward-based NLHF can maintain minority preferences when there is no Condorcet winning response. Green text indicates properties of NLHF, while blue text indicates properties of RLHF. Under the BTL model, both RLHF- and NLHF-aligned LLMs have the same property, thus we use black text. The abbreviation "w.h.p." stands for "with high probability." In general, when collapsing, a RLHF-aligned LLM might not necessarily collapse to generate the Condorcet winning response (see discussion in Example \ref{['exam:rlhf_solution']}). "Other" reward models are discussed in Section \ref{['sec:benefit']}.
  • Figure 2: Demonstration for $n=3$: the red arrows show the Hamiltonian path in the directed graph.
  • Figure 3: Demonstration for induction in case $n+1$: the red arrows show the Hamiltonian path in the graph.
  • Figure 4: Demonstration for induction in case $n+1$: the red arrows show the Hamiltonian path in the graph.
  • Figure 5: Comparison between our implicit reward and the reward of RLHF when there are two responses for different parameters.

Theorems & Definitions (75)

  • Definition 2.3: Reward-Consistent Preference
  • Example 2.4: Condorcet paradox gehrlein2006condorcet
  • Definition 2.5: Condorcet Cycle
  • Theorem 2.1: Necessary and Sufficient Conditions for Reward Modeling
  • Remark 2.6
  • Proposition 2.7
  • Proposition 2.8
  • Theorem 2.2
  • Example 3.2
  • Example 3.3
  • ...and 65 more