Table of Contents
Fetching ...

Robust AI Evaluation through Maximal Lotteries

Hadi Khalaf, Serena L. Wang, Daniel Halpern, Itai Shapira, Flavio du Pin Calmon, Ariel D. Procaccia

TL;DR

It is shown that maximal lotteries are highly sensitive to preference heterogeneity and can favor models that severely underperform on specific tasks or user subpopulations, and introduces robust lotteries that optimize worst-case performance under plausible shifts in the preference data.

Abstract

The standard way to evaluate language models on subjective tasks is through pairwise comparisons: an annotator chooses the "better" of two responses to a prompt. Leaderboards aggregate these comparisons into a single Bradley-Terry (BT) ranking, forcing heterogeneous preferences into a total order and violating basic social-choice desiderata. In contrast, social choice theory provides an alternative approach called maximal lotteries, which aggregates pairwise preferences without imposing any assumptions on their structure. However, we show that maximal lotteries are highly sensitive to preference heterogeneity and can favor models that severely underperform on specific tasks or user subpopulations. We introduce robust lotteries that optimize worst-case performance under plausible shifts in the preference data. On large-scale preference datasets, robust lotteries provide more reliable win rate guarantees across the annotator distribution and recover a stable set of top-performing models. By moving from rankings to pluralistic sets of winners, robust lotteries offer a principled step toward an ecosystem of complementary AI systems that serve the full spectrum of human preferences.

Robust AI Evaluation through Maximal Lotteries

TL;DR

It is shown that maximal lotteries are highly sensitive to preference heterogeneity and can favor models that severely underperform on specific tasks or user subpopulations, and introduces robust lotteries that optimize worst-case performance under plausible shifts in the preference data.

Abstract

The standard way to evaluate language models on subjective tasks is through pairwise comparisons: an annotator chooses the "better" of two responses to a prompt. Leaderboards aggregate these comparisons into a single Bradley-Terry (BT) ranking, forcing heterogeneous preferences into a total order and violating basic social-choice desiderata. In contrast, social choice theory provides an alternative approach called maximal lotteries, which aggregates pairwise preferences without imposing any assumptions on their structure. However, we show that maximal lotteries are highly sensitive to preference heterogeneity and can favor models that severely underperform on specific tasks or user subpopulations. We introduce robust lotteries that optimize worst-case performance under plausible shifts in the preference data. On large-scale preference datasets, robust lotteries provide more reliable win rate guarantees across the annotator distribution and recover a stable set of top-performing models. By moving from rankings to pluralistic sets of winners, robust lotteries offer a principled step toward an ecosystem of complementary AI systems that serve the full spectrum of human preferences.
Paper Structure (22 sections, 12 theorems, 126 equations, 21 figures, 9 tables)

This paper contains 22 sections, 12 theorems, 126 equations, 21 figures, 9 tables.

Key Result

Theorem 8

If $i^\star$ is an RCW, then the point mass $e_{i^\star}$ is a robust lottery. If $i^\star$ is a strict RCW, then $e_{i^\star}$ is the unique robust lottery.

Figures (21)

  • Figure 1: Robust lotteries improve worst-case win rate across user subpopulations. We consider a fixed set of models (red points) and partition the LMArena votes according to the language of the prompt. We then evaluate any lottery (a probability distribution over models) by its worst-case performance on LMArena ($y$-axis): the minimum win rate the lottery achieves across the four user subpopulations. Because the guarantee is $\min_q \Pr(p \succ q)$ in a symmetric zero-sum game, it is at most 50%. The $x$-axis reports the lottery’s expected inference cost per 1M input tokens. The orange curve traces the cost--performance frontier obtained by solving for the maximal lottery under an expected-cost budget. The blue curve traces the corresponding frontier for robust lotteries, which instead optimize the worst-case guarantee using a robust linear program. Across budgets, robust lotteries achieve substantially higher worst-case performance than the standard maximal lottery.
  • Figure 2: Maximal lotteries are sensitive to population shifts. Each simplex point is a lottery $p=(p_1,p_2,p_3)$ over models $\{\text{1, 2, 3}\}$ with vertices being deterministic choices. Edge arrows indicate the direction of preference. In each panel, the dashed lines are zero-margin boundaries $f_j(p)=p^\top M e_j=0$, i.e., mixtures that tie pure opponent $j$ under the stratum margin matrix $M$. Left: English stratum has a Condorcet winner, so the maximal lottery collapses to the winner ($p^\star_{\mathrm{EN}}$ at a vertex). Right: Spanish stratum exhibits a 3-cycle, so $p^\star_{\mathrm{ES}}$ lies in the interior, balancing the worst-case opponent. Middle: two nearby population mixtures (weights $\alpha$ and $\alpha'$ over strata) induce two nearby matrices and thus two sets of zero-margin lines; their intersections give $p^\star_{\mathrm{mix}}$ and $p^\star_{\mathrm{mix}}{}'$, illustrating that even with the same majority directions, small shifts in mixture weights can move the maximal lottery. This sensitivity motivates robust maximal lotteries, which optimize a worst-case guarantee over a set of plausible population mixtures.
  • Figure 3: Robust lotteries improve win rate guarantees across subpopulations. We compute robust lotteries for varying radius values $\rho$ and evaluate each lottery on held-out votes (20% split). We show bootstrap means of win rates achieved on the overall population and each subgroup with standard errors (200 samples). As $\rho$ increases, robust lotteries improve the win rate guarantees for the lowest-performing groups with a modest decrease for the highest-performing groups, illustrating a robustness--accuracy trade-off. Left: LMArena, with groups defined by prompt language. Right: HUMAINE, with groups defined by annotator's ethnic group.
  • Figure 4: Robust lotteries diversify the lottery to handle preference tradeoffs among subpopulations. We present the estimated probability assigned to each model with its standard error from LMArena (200 samples). At $\rho=0$, the lottery concentrates on the top aggregate performer (Gemini 2.5 Pro). As $\rho$ increases, probability mass shifts toward additional strong models (e.g., o3 and Llama 4 Maverick), reflecting the need to hedge to improve win rate guarantees as seen in \ref{['fig:winrate_test']}.
  • Figure B.1: Weekly (left) and cumulative (right) mixture distributions of vote categories in LMArena. The weekly plot shows the fraction of votes collected per category each week, while the cumulative plot shows the category composition of all votes collected up to a given week.
  • ...and 16 more figures

Theorems & Definitions (38)

  • Definition 1: Majority Margin Matrix
  • Definition 2: Maximal Lotteries fishburn1984
  • Definition 3: Bipartisan Set laffond1993
  • Definition 4: Ambiguity Set
  • Definition 5: Robust Lotteries
  • Definition 6: Robust Bipartisan Set
  • Definition 7: Robust Condorcet Winner
  • Theorem 8: Robust Condorcet Consistency
  • Theorem 9: Robust Dominance
  • Definition 10: Adding weaker clones
  • ...and 28 more