Table of Contents
Fetching ...

Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat

Roland Daynauth, Christopher Clarke, Krisztian Flautner, Lingjia Tang, Jason Mars

TL;DR

The paper addresses the problem of reliably ranking LLMs using head-to-head pairwise evaluations, showing that common methods like Elo can yield unstable and sometimes unreliable rankings. It formalizes four ranking approaches—Elo, Bradley-Terry, Glicko, and Markov Chain—and assesses them on two diverse datasets (Arena and SLAM) across transitivity, prediction accuracy, and hyperparameter sensitivity. Key findings include Bradley-Terry’s strong transitivity preservation, Elo’s instability, and Glicko’s robustness through rating deviation, with practical guidelines tailored to dataset size and distribution. The work provides actionable recommendations for selecting ranking methods in real-world LLM evaluation contexts and contributes reproducible data and code for ongoing research.

Abstract

Deciding which large language model (LLM) to use is a complex challenge. Pairwise ranking has emerged as a new method for evaluating human preferences for LLMs. This approach entails humans evaluating pairs of model outputs based on a predefined criterion. By collecting these comparisons, a ranking can be constructed using methods such as Elo. However, applying these algorithms as constructed in the context of LLM evaluation introduces several challenges. In this paper, we explore the effectiveness of ranking systems for head-to-head comparisons of LLMs. We formally define a set of fundamental principles for effective ranking and conduct a series of extensive evaluations on the robustness of several ranking algorithms in the context of LLMs. Our analysis uncovers key insights into the factors that affect ranking accuracy and efficiency, offering guidelines for selecting the most appropriate methods based on specific evaluation contexts and resource constraints.

Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat

TL;DR

The paper addresses the problem of reliably ranking LLMs using head-to-head pairwise evaluations, showing that common methods like Elo can yield unstable and sometimes unreliable rankings. It formalizes four ranking approaches—Elo, Bradley-Terry, Glicko, and Markov Chain—and assesses them on two diverse datasets (Arena and SLAM) across transitivity, prediction accuracy, and hyperparameter sensitivity. Key findings include Bradley-Terry’s strong transitivity preservation, Elo’s instability, and Glicko’s robustness through rating deviation, with practical guidelines tailored to dataset size and distribution. The work provides actionable recommendations for selecting ranking methods in real-world LLM evaluation contexts and contributes reproducible data and code for ongoing research.

Abstract

Deciding which large language model (LLM) to use is a complex challenge. Pairwise ranking has emerged as a new method for evaluating human preferences for LLMs. This approach entails humans evaluating pairs of model outputs based on a predefined criterion. By collecting these comparisons, a ranking can be constructed using methods such as Elo. However, applying these algorithms as constructed in the context of LLM evaluation introduces several challenges. In this paper, we explore the effectiveness of ranking systems for head-to-head comparisons of LLMs. We formally define a set of fundamental principles for effective ranking and conduct a series of extensive evaluations on the robustness of several ranking algorithms in the context of LLMs. Our analysis uncovers key insights into the factors that affect ranking accuracy and efficiency, offering guidelines for selecting the most appropriate methods based on specific evaluation contexts and resource constraints.

Paper Structure

This paper contains 44 sections, 15 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Different ranking algorithms can produce different rankings with the same human evaluation data, making it difficult to determine which algorithm is appropriate for various use cases.
  • Figure 2: Distribution of F1 scores for the SLAM (left) and ARENA (right) datasets, showing the performance of Elo, Markov, and Glicko algorithms across a subset of models with 100 different hyperparameter settings. The results highlight the volatility of Elo, demonstrating its high sensitivity to hyperparameter changes compared to the more stable performance of Markov and Glicko.
  • Figure 3: F1 scores for all models in the SLAM dataset.
  • Figure 4: F1 scores of the top ten (10) highest and lowest ranked models using the Elo rating system on the Chatbot Arena dataset. In general, Elo provides the best prediction accuracy by achieving an F1 score of $0.90$.
  • Figure 5: Elo produces different ranks based on the value of the hyperparameter $k$. Increasing the number of permutation can lead of more stable ratings, however, model ranks may still be unstable as is the case with orca-min_3b and neural-chat_7b.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Definition 3.1
  • Definition A.1