Table of Contents
Fetching ...

Prediction-Powered Ranking of Large Language Models

Ivi Chatzi, Eleni Straitouri, Suhas Thejaswi, Manuel Gomez Rodriguez

TL;DR

The paper tackles the challenge of ranking large language models by human preferences when human-labeled pairwise data are scarce and model-derived comparisons are abundant but potentially misaligned. It introduces a prediction-powered inference framework that constructs a confidence ellipsoid for the human-consistent win probabilities and converts it into rank-sets with provable coverage guarantees. By integrating a small human-labeled dataset with a large set of model-driven comparisons, the method yields uncertainty-aware rankings and robust performance over purely model-driven rankings. Empirical results on LMSYS Chatbot Arena show that rank-sets informed by human data are more likely to cover the true human-consistent ranking than those based solely on strong LLM comparisons, highlighting practical gains for reliable model evaluation. The work also provides open-source code, enabling broader adoption and future refinements in uncertainty-aware LLM evaluation.

Abstract

Large language models are often ranked according to their level of alignment with human preferences -- a model is better than other models if its outputs are more frequently preferred by humans. One of the popular ways to elicit human preferences utilizes pairwise comparisons between the outputs provided by different models to the same inputs. However, since gathering pairwise comparisons by humans is costly and time-consuming, it has become a common practice to gather pairwise comparisons by a strong large language model -- a model strongly aligned with human preferences. Surprisingly, practitioners cannot currently measure the uncertainty that any mismatch between human and model preferences may introduce in the constructed rankings. In this work, we develop a statistical framework to bridge this gap. Given a (small) set of pairwise comparisons by humans and a large set of pairwise comparisons by a model, our framework provides a rank-set -- a set of possible ranking positions -- for each of the models under comparison. Moreover, it guarantees that, with a probability greater than or equal to a user-specified value, the rank-sets cover the true ranking consistent with the distribution of human pairwise preferences asymptotically. Using pairwise comparisons made by humans in the LMSYS Chatbot Arena platform and pairwise comparisons made by three strong large language models, we empirically demonstrate the effectivity of our framework and show that the rank-sets constructed using only pairwise comparisons by the strong large language models are often inconsistent with (the distribution of) human pairwise preferences.

Prediction-Powered Ranking of Large Language Models

TL;DR

The paper tackles the challenge of ranking large language models by human preferences when human-labeled pairwise data are scarce and model-derived comparisons are abundant but potentially misaligned. It introduces a prediction-powered inference framework that constructs a confidence ellipsoid for the human-consistent win probabilities and converts it into rank-sets with provable coverage guarantees. By integrating a small human-labeled dataset with a large set of model-driven comparisons, the method yields uncertainty-aware rankings and robust performance over purely model-driven rankings. Empirical results on LMSYS Chatbot Arena show that rank-sets informed by human data are more likely to cover the true human-consistent ranking than those based solely on strong LLM comparisons, highlighting practical gains for reliable model evaluation. The work also provides open-source code, enabling broader adoption and future refinements in uncertainty-aware LLM evaluation.

Abstract

Large language models are often ranked according to their level of alignment with human preferences -- a model is better than other models if its outputs are more frequently preferred by humans. One of the popular ways to elicit human preferences utilizes pairwise comparisons between the outputs provided by different models to the same inputs. However, since gathering pairwise comparisons by humans is costly and time-consuming, it has become a common practice to gather pairwise comparisons by a strong large language model -- a model strongly aligned with human preferences. Surprisingly, practitioners cannot currently measure the uncertainty that any mismatch between human and model preferences may introduce in the constructed rankings. In this work, we develop a statistical framework to bridge this gap. Given a (small) set of pairwise comparisons by humans and a large set of pairwise comparisons by a model, our framework provides a rank-set -- a set of possible ranking positions -- for each of the models under comparison. Moreover, it guarantees that, with a probability greater than or equal to a user-specified value, the rank-sets cover the true ranking consistent with the distribution of human pairwise preferences asymptotically. Using pairwise comparisons made by humans in the LMSYS Chatbot Arena platform and pairwise comparisons made by three strong large language models, we empirically demonstrate the effectivity of our framework and show that the rank-sets constructed using only pairwise comparisons by the strong large language models are often inconsistent with (the distribution of) human pairwise preferences.
Paper Structure (17 sections, 1 theorem, 20 equations, 17 figures, 1 table, 4 algorithms)

This paper contains 17 sections, 1 theorem, 20 equations, 17 figures, 1 table, 4 algorithms.

Key Result

Theorem 4.1

The estimates $[\hat{l}(m), \hat{u}(m)]$ of the rank-sets defined by Eq. eq:estimates-rank-sets satisfy that

Figures (17)

  • Figure 1: Average rank-set size against baseline intersection probability for rank-sets constructed using only pairwise comparisons by a strong LLM (LlmGpt4, LlmGpt3.5 and LlmCl3), only pairwise comparisons by humans (Human Only), and pairwise comparisons by both a strong LLM and humans (PprGpt4, PprGpt3.5 and PprCl3) for different values of $\alpha$ and $n=990$. Smaller (larger) average rank-set sizes and larger (smaller) intersection probabilities are better (worse). In all panels, $95\%$ confidence bars for the rank-set size are not shown, as they are always below $0.02$.
  • Figure 2: Average rank-set size against baseline intersection probability for rank-sets constructed using pairwise comparisons by both a strong LLM and humans for different values of $n$ and $\alpha$. Smaller (larger) average rank-set sizes and larger (smaller) intersection probabilities are better (worse). In all panels, $95\%$ confidence bars for the rank-set size are not shown, as they are always below $0.04$.
  • Figure 3: Empirical probability that each ranking position is included in the rank-sets constructed by Baseline, LlmGpt4 and PprGpt4 for each of the LLMs under comparison. In all panels, $n=990$ and $\alpha=0.05$. Larger (smaller) dots indicate higher (lower) empirical probability.
  • Figure 4: Empirical probability of each rank-set constructed by Baseline, LlmGpt4 and PprGpt4 for GPT 4 (left), Claude 1 (middle left), Vicuna (middle right) and PaLM 2 (right). In all panels, $n=990$ and $\alpha=0.05$.
  • Figure 5: The number of pairwise comparisons per each pair of models after all preprocessing steps.
  • ...and 12 more figures

Theorems & Definitions (1)

  • Theorem 4.1