Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat
Roland Daynauth, Christopher Clarke, Krisztian Flautner, Lingjia Tang, Jason Mars
TL;DR
The paper addresses the problem of reliably ranking LLMs using head-to-head pairwise evaluations, showing that common methods like Elo can yield unstable and sometimes unreliable rankings. It formalizes four ranking approaches—Elo, Bradley-Terry, Glicko, and Markov Chain—and assesses them on two diverse datasets (Arena and SLAM) across transitivity, prediction accuracy, and hyperparameter sensitivity. Key findings include Bradley-Terry’s strong transitivity preservation, Elo’s instability, and Glicko’s robustness through rating deviation, with practical guidelines tailored to dataset size and distribution. The work provides actionable recommendations for selecting ranking methods in real-world LLM evaluation contexts and contributes reproducible data and code for ongoing research.
Abstract
Deciding which large language model (LLM) to use is a complex challenge. Pairwise ranking has emerged as a new method for evaluating human preferences for LLMs. This approach entails humans evaluating pairs of model outputs based on a predefined criterion. By collecting these comparisons, a ranking can be constructed using methods such as Elo. However, applying these algorithms as constructed in the context of LLM evaluation introduces several challenges. In this paper, we explore the effectiveness of ranking systems for head-to-head comparisons of LLMs. We formally define a set of fundamental principles for effective ranking and conduct a series of extensive evaluations on the robustness of several ranking algorithms in the context of LLMs. Our analysis uncovers key insights into the factors that affect ranking accuracy and efficiency, offering guidelines for selecting the most appropriate methods based on specific evaluation contexts and resource constraints.
