Re-evaluating Open-ended Evaluation of Large Language Models

Siqi Liu; Ian Gemp; Luke Marris; Georgios Piliouras; Nicolas Heess; Marc Lanctot

Re-evaluating Open-ended Evaluation of Large Language Models

Siqi Liu, Ian Gemp, Luke Marris, Georgios Piliouras, Nicolas Heess, Marc Lanctot

TL;DR

This work addresses biases in open-ended LLM evaluation that arise from redundancy in prompts and responses when using Elo-based ratings. It reframes evaluation as a 3-player general-sum game and introduces invariant equilibrium concepts, notably affinity entropy, to produce clone-invariant NE and CCE ratings. The authors provide practical solving methods and demonstrate that equilibrium-based ratings remain intuitive and robust to redundancy, while revealing rich interpretability through prompt-model interaction analyses. Empirical results on arena-hard-v0.1 and synthetic simulations show that equilibrium ratings encourage well-rounded skill development and offer actionable insights into the competitive landscape of frontier LLMs. Overall, the proposed framework offers robust, interpretable metrics for open-ended evaluation with potential applicability beyond LLMs to broader AI evaluation tasks.

Abstract

Evaluation has traditionally focused on ranking candidates for a specific skill. Modern generalist models, such as Large Language Models (LLMs), decidedly outpace this paradigm. Open-ended evaluation systems, where candidate models are compared on user-submitted prompts, have emerged as a popular solution. Despite their many advantages, we show that the current Elo-based rating systems can be susceptible to and even reinforce biases in data, intentional or accidental, due to their sensitivity to redundancies. To address this issue, we propose evaluation as a 3-player game, and introduce novel game-theoretic solution concepts to ensure robustness to redundancy. We show that our method leads to intuitive ratings and provide insights into the competitive landscape of LLM development.

Re-evaluating Open-ended Evaluation of Large Language Models

TL;DR

Abstract

Re-evaluating Open-ended Evaluation of Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (17)