Table of Contents
Fetching ...

ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition

Haidar Khan, Hisham A. Alyahya, Yazeed Alnumay, M Saiful Bari, Bülent Yener

TL;DR

ZeroSumEval introduces a scalable, competition-based framework for evaluating LLMs using zero-sum games to create dynamic benchmarks that resist saturation and data contamination. It combines a diverse game suite (including Chess, Poker, Liar's Dice, MathQuiz, PyJail, Gandalf, Debate) with automated verification, DSPy-based prompt abstraction, and a Bradley-Terry rating system to robustly compare models. Key findings show frontier models perform well on common games but struggle to generate novel and challenging tests, and jailbreak between models is generally unreliable, highlighting gaps in creativity and test-generation abilities. The work provides a practical, extensible evaluation pathway with open-source tooling, enabling more reliable, scalable, and interpretable cross-model comparisons with broad applicability to model development and deployment decisions.

Abstract

Evaluating the capabilities of Large Language Models (LLMs) has traditionally relied on static benchmark datasets, human assessments, or model-based evaluations - methods that often suffer from overfitting, high costs, and biases. ZeroSumEval is a novel competition-based evaluation protocol that leverages zero-sum games to assess LLMs with dynamic benchmarks that resist saturation. ZeroSumEval encompasses a diverse suite of games, including security challenges (PyJail), classic games (Chess, Liar's Dice, Poker), knowledge tests (MathQuiz), and persuasion challenges (Gandalf, Debate). These games are designed to evaluate a range of AI capabilities such as strategic reasoning, planning, knowledge application, and creativity. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework. To demonstrate this, we conduct extensive experiments with >7000 simulations across 7 games and 13 models. Our results show that while frontier models from the GPT and Claude families can play common games and answer questions, they struggle to play games that require creating novel and challenging questions. We also observe that models cannot reliably jailbreak each other and fail generally at tasks requiring creativity. We release our code at https://github.com/facebookresearch/ZeroSumEval.

ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition

TL;DR

ZeroSumEval introduces a scalable, competition-based framework for evaluating LLMs using zero-sum games to create dynamic benchmarks that resist saturation and data contamination. It combines a diverse game suite (including Chess, Poker, Liar's Dice, MathQuiz, PyJail, Gandalf, Debate) with automated verification, DSPy-based prompt abstraction, and a Bradley-Terry rating system to robustly compare models. Key findings show frontier models perform well on common games but struggle to generate novel and challenging tests, and jailbreak between models is generally unreliable, highlighting gaps in creativity and test-generation abilities. The work provides a practical, extensible evaluation pathway with open-source tooling, enabling more reliable, scalable, and interpretable cross-model comparisons with broad applicability to model development and deployment decisions.

Abstract

Evaluating the capabilities of Large Language Models (LLMs) has traditionally relied on static benchmark datasets, human assessments, or model-based evaluations - methods that often suffer from overfitting, high costs, and biases. ZeroSumEval is a novel competition-based evaluation protocol that leverages zero-sum games to assess LLMs with dynamic benchmarks that resist saturation. ZeroSumEval encompasses a diverse suite of games, including security challenges (PyJail), classic games (Chess, Liar's Dice, Poker), knowledge tests (MathQuiz), and persuasion challenges (Gandalf, Debate). These games are designed to evaluate a range of AI capabilities such as strategic reasoning, planning, knowledge application, and creativity. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework. To demonstrate this, we conduct extensive experiments with >7000 simulations across 7 games and 13 models. Our results show that while frontier models from the GPT and Claude families can play common games and answer questions, they struggle to play games that require creating novel and challenging questions. We also observe that models cannot reliably jailbreak each other and fail generally at tasks requiring creativity. We release our code at https://github.com/facebookresearch/ZeroSumEval.

Paper Structure

This paper contains 26 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Cumulative ratings of 13 models on ZeroSumEval. The top performing models (gpt-4o and claude-3.7-sonnet) show mostly on-par performance across ZeroSumEval games. Thinking model quality varies between model families (e.g. claude-3.7-sonnet-thinking vs deepseek-r1). Surprisingly, o3-mini-high performs worst amongst this cohort of models.
  • Figure 2: Example of a chess game trace. Here deepseek-chat (white) executes a knight fork against llama3.3-70b (black) after which black loses the game by failing to produce a legal move.
  • Figure 3: Example of a MathQuiz game trace. At this point in the game, claude-3.7-sonnet (teacher) has generated a question and proven the question is valid by solving it. Now it is llama3.1-405b's (student) turn to answer the question (which it fails to do after multiple attempts).
  • Figure 4: State diagram of the verification process involving the ZSEval Manager and the LLM Generator. Blue boxes indicate deterministic steps and green boxes indicate steps involving the LLM.
  • Figure 5: Summarized outcomes from four games (A) Chess, (B) Gandalf, (C) MathQuiz, and (D) Poker. Most models can easily play valid Chess and Poker. Models struggle with the creative aspects of Gandalf and creating valid but challenging MathQuiz challenges.
  • ...and 2 more figures