Re-evaluating Open-ended Evaluation of Large Language Models
Siqi Liu, Ian Gemp, Luke Marris, Georgios Piliouras, Nicolas Heess, Marc Lanctot
TL;DR
This work addresses biases in open-ended LLM evaluation that arise from redundancy in prompts and responses when using Elo-based ratings. It reframes evaluation as a 3-player general-sum game and introduces invariant equilibrium concepts, notably affinity entropy, to produce clone-invariant NE and CCE ratings. The authors provide practical solving methods and demonstrate that equilibrium-based ratings remain intuitive and robust to redundancy, while revealing rich interpretability through prompt-model interaction analyses. Empirical results on arena-hard-v0.1 and synthetic simulations show that equilibrium ratings encourage well-rounded skill development and offer actionable insights into the competitive landscape of frontier LLMs. Overall, the proposed framework offers robust, interpretable metrics for open-ended evaluation with potential applicability beyond LLMs to broader AI evaluation tasks.
Abstract
Evaluation has traditionally focused on ranking candidates for a specific skill. Modern generalist models, such as Large Language Models (LLMs), decidedly outpace this paradigm. Open-ended evaluation systems, where candidate models are compared on user-submitted prompts, have emerged as a popular solution. Despite their many advantages, we show that the current Elo-based rating systems can be susceptible to and even reinforce biases in data, intentional or accidental, due to their sensitivity to redundancies. To address this issue, we propose evaluation as a 3-player game, and introduce novel game-theoretic solution concepts to ensure robustness to redundancy. We show that our method leads to intuitive ratings and provide insights into the competitive landscape of LLM development.
