Table of Contents
Fetching ...

Re-evaluating Open-ended Evaluation of Large Language Models

Siqi Liu, Ian Gemp, Luke Marris, Georgios Piliouras, Nicolas Heess, Marc Lanctot

TL;DR

This work addresses biases in open-ended LLM evaluation that arise from redundancy in prompts and responses when using Elo-based ratings. It reframes evaluation as a 3-player general-sum game and introduces invariant equilibrium concepts, notably affinity entropy, to produce clone-invariant NE and CCE ratings. The authors provide practical solving methods and demonstrate that equilibrium-based ratings remain intuitive and robust to redundancy, while revealing rich interpretability through prompt-model interaction analyses. Empirical results on arena-hard-v0.1 and synthetic simulations show that equilibrium ratings encourage well-rounded skill development and offer actionable insights into the competitive landscape of frontier LLMs. Overall, the proposed framework offers robust, interpretable metrics for open-ended evaluation with potential applicability beyond LLMs to broader AI evaluation tasks.

Abstract

Evaluation has traditionally focused on ranking candidates for a specific skill. Modern generalist models, such as Large Language Models (LLMs), decidedly outpace this paradigm. Open-ended evaluation systems, where candidate models are compared on user-submitted prompts, have emerged as a popular solution. Despite their many advantages, we show that the current Elo-based rating systems can be susceptible to and even reinforce biases in data, intentional or accidental, due to their sensitivity to redundancies. To address this issue, we propose evaluation as a 3-player game, and introduce novel game-theoretic solution concepts to ensure robustness to redundancy. We show that our method leads to intuitive ratings and provide insights into the competitive landscape of LLM development.

Re-evaluating Open-ended Evaluation of Large Language Models

TL;DR

This work addresses biases in open-ended LLM evaluation that arise from redundancy in prompts and responses when using Elo-based ratings. It reframes evaluation as a 3-player general-sum game and introduces invariant equilibrium concepts, notably affinity entropy, to produce clone-invariant NE and CCE ratings. The authors provide practical solving methods and demonstrate that equilibrium-based ratings remain intuitive and robust to redundancy, while revealing rich interpretability through prompt-model interaction analyses. Empirical results on arena-hard-v0.1 and synthetic simulations show that equilibrium ratings encourage well-rounded skill development and offer actionable insights into the competitive landscape of frontier LLMs. Overall, the proposed framework offers robust, interpretable metrics for open-ended evaluation with potential applicability beyond LLMs to broader AI evaluation tasks.

Abstract

Evaluation has traditionally focused on ranking candidates for a specific skill. Modern generalist models, such as Large Language Models (LLMs), decidedly outpace this paradigm. Open-ended evaluation systems, where candidate models are compared on user-submitted prompts, have emerged as a popular solution. Despite their many advantages, we show that the current Elo-based rating systems can be susceptible to and even reinforce biases in data, intentional or accidental, due to their sensitivity to redundancies. To address this issue, we propose evaluation as a 3-player game, and introduce novel game-theoretic solution concepts to ensure robustness to redundancy. We show that our method leads to intuitive ratings and provide insights into the competitive landscape of LLM development.

Paper Structure

This paper contains 32 sections, 3 theorems, 27 equations, 11 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Affinity entropy $H_a^p$ satisfies all desiderata P1-P4.

Figures (11)

  • Figure 1: (Left) We simulate the effect of the rating method on model development with users submitting highly rated models (and prompts) iteratively. (Center) We show how model and prompt skill entropy evolves under different rating methods over 32 trials. (Right) We show an example sequence of models and prompts maximising their respective ratings. Darker indicates higher value.
  • Figure 2: We inspect the model improvement path induced by NE ratings as shown in Figure \ref{['fig:elo_vs_equilibrium']}(Right). (Left) shows the sequence of additional prompts added at each iteration. Each prompt is the best-of-64 samples according to their NE ratings. (Center) shows the sequence of prompt player NEs. Each row defines a distribution over prompts. (Right) shows the equilibrium-weighted prompt skills and the sequence of king player models. Recall prompts and models are non-negative vectors over skills, darker indicates higher focus or capability in each skill.
  • Figure 3: We introduce an increasing number of redundant copies of prompts adversarial to gemini-1.5-pro-api-0514 and show model rankings under each method. Models at the same rank are grouped in grey and ordered alphabetically. (Right) We show equilibrium rankings under NE(-a) and CCE(-a) selected using Shannon's entropy instead of the affinity entropy. Dotted lines connecting different rating panels indicate continuity in the labeling. For instance, gemini-1.5-pro-api-0514 consistently ranks first under our NE and CCE ratings, despite the introduction of up to 500 redundant adversarial prompts. However, its ranking suffered significantly under the Elo ratings as soon as 250 adversarial prompts have been introduced.
  • Figure 4: Highly rated prompts generally have high support under the NE. Redundant prompts (gray bands) receive identical ratings but notably lower support. In sum, equilibrium ratings reflect separability of each prompt with respect to the model equilibrium strategies in isolation, whereas equilibrium support of each prompt further accounts for its redundancy with respect to other prompts. (Top) We show the king-vs-rebel payoffs induced by example prompts. Green indicates king-player winning and red losing. Highly rated prompts tend to discriminate between strong models (top-left corners). (Bottom) We show the NE supports and ratings of all prompts, ordered by their NE ratings.
  • Figure 5: The CCE joint distribution can surface insights in the comparison data. Each bar represents a model family ${\mathcal{F}}$ and its width corresponds to $\sum_{a_r \in {\mathcal{F}}} \delta(a'_k, a_r, {\bm{x}})$ with $a'_k$ a king player model choice and $a_r$ a rebel model belonging to the family ${\mathcal{F}}$. A model's family is determined by its model name prefix. For brevity, we show the king model rating breakdown for the top 5 models.
  • ...and 6 more figures

Theorems & Definitions (17)

  • Definition 1: Affinity Entropy $H_a^p$
  • Theorem 1
  • Remark
  • Remark
  • Remark
  • Remark
  • Remark
  • Remark
  • Remark
  • Remark
  • ...and 7 more