Table of Contents
Fetching ...

Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification

Angel Rodrigo Avelar Menendez, Yufeng Liu, Xiaowu Dai

TL;DR

This work studies prompt-dependent ranking inference under pairwise human preferences and develops a framework for decision-safe rankings with statistically valid uncertainty guarantees, and demonstrates how uncertainty-aware rankings identify dominance only when supported by the data and otherwise return partial orders.

Abstract

Rankings derived from pairwise comparisons are central to many economic and computational systems. In the context of large language models (LLMs), rankings are typically constructed from human preference data and presented as leaderboards that guide deployment decisions. However, existing approaches rely on point estimates, implicitly treating rankings as fixed objects despite substantial estimation noise and context-dependent performance variation. Acting on such rankings can lead to misallocation and welfare loss when apparent differences are not statistically meaningful. We study prompt-dependent ranking inference under pairwise human preferences and develop a framework for decision-safe rankings with statistically valid uncertainty guarantees. We model preferences using a contextual Bradley-Terry-Luce model in which the latent utility of each model depends on the input prompt. Rather than targeting point estimates of utilities, we directly conduct inference on induced rankings, constructing confidence sets based on simultaneous confidence intervals for pairwise utility differences. This approach yields statistically valid marginal and simultaneous confidence sets for prompt-specific ranks. Our framework connects recent advances in rank inference to contextual preference learning and provides tools for robust ranking-based decision-making. Empirically, using large-scale human preference data from LLM evaluations, we show that rankings vary substantially across prompt characteristics and that many apparent rank differences are not statistically distinguishable. We further demonstrate how uncertainty-aware rankings identify dominance only when supported by the data and otherwise return partial orders.

Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification

TL;DR

This work studies prompt-dependent ranking inference under pairwise human preferences and develops a framework for decision-safe rankings with statistically valid uncertainty guarantees, and demonstrates how uncertainty-aware rankings identify dominance only when supported by the data and otherwise return partial orders.

Abstract

Rankings derived from pairwise comparisons are central to many economic and computational systems. In the context of large language models (LLMs), rankings are typically constructed from human preference data and presented as leaderboards that guide deployment decisions. However, existing approaches rely on point estimates, implicitly treating rankings as fixed objects despite substantial estimation noise and context-dependent performance variation. Acting on such rankings can lead to misallocation and welfare loss when apparent differences are not statistically meaningful. We study prompt-dependent ranking inference under pairwise human preferences and develop a framework for decision-safe rankings with statistically valid uncertainty guarantees. We model preferences using a contextual Bradley-Terry-Luce model in which the latent utility of each model depends on the input prompt. Rather than targeting point estimates of utilities, we directly conduct inference on induced rankings, constructing confidence sets based on simultaneous confidence intervals for pairwise utility differences. This approach yields statistically valid marginal and simultaneous confidence sets for prompt-specific ranks. Our framework connects recent advances in rank inference to contextual preference learning and provides tools for robust ranking-based decision-making. Empirically, using large-scale human preference data from LLM evaluations, we show that rankings vary substantially across prompt characteristics and that many apparent rank differences are not statistically distinguishable. We further demonstrate how uncertainty-aware rankings identify dominance only when supported by the data and otherwise return partial orders.
Paper Structure (45 sections, 16 theorems, 90 equations, 4 figures, 4 tables)

This paper contains 45 sections, 16 theorems, 90 equations, 4 figures, 4 tables.

Key Result

Theorem 2

Suppose Assumptions as:comparison_graph_connectivity--as:limit_fisher hold. Then as $L\to\infty$, where $\mathcal{P}$ is defined in eq:projection_matrix as the projection onto the constrained parameter space, and $\bar{\mathcal{I}}$ is the limiting Fisher information defined in Assumption as:limit_fisher.

Figures (4)

  • Figure 1: Illustrative example: prompt-dependent rankings and ranking uncertainty. (\ref{['fig:token_length_bump_chart']}) Estimated rankings of five LLMs as a function of prompt length, measured by token count. Rankings are obtained by ordering the fitted prompt-dependent utilities $\hat{\theta}_i(x)$, with rank changes occurring at intersections of the estimated utility functions. (\ref{['fig:token_length_ranking_ci_vicuna']}) Prompt-dependent 95% simultaneous confidence set for the rank of Vicuna-7b and GPT-4-1106 as a function of prompt length. For short prompts, the rank confidence sets are narrow, indicating statistically supported dominance relationships. As prompt length increases, the confidence sets widen, reflecting growing uncertainty and partial identification of the ranking.
  • Figure 2: Prompt-dependent rankings and ranking uncertainty for the Specificity category. Predicted rankings and 95% marginal confidence intervals for ten LLMs under intrinsic preferences and under prompts exclusively labeled with the Specificity category. Intrinsic estimates, corresponding to a zero covariate vector, exhibit wide confidence intervals, indicating substantial uncertainty in aggregate preferences. Introducing the Specificity category alters both predicted ranks and their uncertainty. While many apparent rank differences remain statistically insignificant, Grok-4 exhibits statistically supported dominance with a singleton confidence interval. The figure illustrates how prompt characteristics affect both rankings and their reliability, and why uncertainty-aware rankings are essential for decision making.
  • Figure 3: Rankings and uncertainty under multi-category prompts. Predicted rankings and 95% marginal confidence intervals for ten LLMs under two composite prompt types. Bold estimates correspond to prompts associated with Code, Complexity, Domain Knowledge, Problem Solving, Real World, and Technical Accuracy, representing complex coding tasks. Lighter estimates correspond to prompts associated with Creativity and Creative Writing. The figure highlights task-dependent specialization and shows that ranking uncertainty differs substantially across prompt mixtures.
  • Figure 4: Asymptotic 95% simultaneous confidence intervals for covariate-only utility differences under prompt-length extrapolation. The left panel shows lower bounds and the right panel shows upper bounds. All intervals contain zero, indicating that no pairwise ordering is statistically resolved in the extreme prompt-length regime.

Theorems & Definitions (17)

  • Definition 1
  • Theorem 2
  • Corollary 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Proposition 7
  • Lemma 8
  • Lemma 9
  • Lemma 10
  • ...and 7 more