Statistical Impossibility and Possibility of Aligning LLMs with Human Preferences: From Condorcet Paradox to Nash Equilibrium
Kaizhao Liu, Qi Long, Zhekun Shi, Weijie J. Su, Jiancong Xiao
TL;DR
This work uncovers fundamental statistical limits in aligning LLMs to human preferences, showing that reward-based models cannot capture distributions when Condorcet cycles arise, with cycles occurring with high probability under a probabilistic labeling model. It then demonstrates that a non-reward-based approach, NLHF, can preserve minority preferences by producing mixed Nash equilibria whenever there is no Condorcet winner, a property that holds with high probability as the number of responses grows. The authors introduce Nash Rejection Sampling (Nash-RS), a single-loop algorithm to compute NLHF equilibria, and provide empirical validation on Llama-3.2-1B that achieves a 60.55% win rate against the base model, supporting NLHF as a statistically favorable alternative to RLHF for maintaining diverse human preferences. The findings offer practical paths toward more fair and diverse alignment, with implications for policy design, evaluation, and future preference-optimization methods in large language models.
Abstract
Aligning large language models (LLMs) with diverse human preferences is critical for ensuring fairness and informed outcomes when deploying these models for decision-making. In this paper, we seek to uncover fundamental statistical limits concerning aligning LLMs with human preferences, with a focus on the probabilistic representation of human preferences and the preservation of diverse preferences in aligned LLMs. We first show that human preferences can be represented by a reward model if and only if the preference among LLM-generated responses is free of any Condorcet cycle. Moreover, we prove that Condorcet cycles exist with probability converging to one exponentially fast under a probabilistic preference model, thereby demonstrating the impossibility of fully aligning human preferences using reward-based approaches such as reinforcement learning from human feedback. Next, we explore the conditions under which LLMs would employ mixed strategies -- meaning they do not collapse to a single response -- when aligned in the limit using a non-reward-based approach, such as Nash learning from human feedback (NLHF). We identify a necessary and sufficient condition for mixed strategies: the absence of a response that is preferred over all others by a majority. As a blessing, we prove that this condition holds with high probability under the probabilistic preference model, thereby highlighting the statistical possibility of preserving minority preferences without explicit regularization in aligning LLMs. Finally, we leverage insights from our statistical results to design a novel, computationally efficient algorithm for finding Nash equilibria in aligning LLMs with NLHF. Our experiments show that Llama-3.2-1B, aligned with our algorithm, achieves a win rate of 60.55\% against the base model.
