Table of Contents
Fetching ...

ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition

Hisham A. Alyahya, Haidar Khan, Yazeed Alnumay, M Saiful Bari, Bülent Yener

TL;DR

ZeroSumEval introduces a dynamic, competition-based framework for evaluating large language models by pitting models against each other in a library of evolving, game-based tasks. It formalizes a modular architecture that separates game logic from strategy, and integrates DSPy-based strategies to enable high-level, prompt-efficient play. The framework supports automated verification for complex knowledge and security challenges and uses Bradley-Terry rankings with bootstrapped confidence to compare head-to-head outcomes. By delivering scalable, interpretable, and extensible benchmarks, ZeroSumEval addresses static-benchmark pitfalls such as data contamination and prompt sensitivity and paves the way for multimodal and adversarial extensions in the future.

Abstract

We introduce ZeroSumEval, a dynamic, competition-based, and evolving evaluation framework for Large Language Models (LLMs) that leverages competitive games. ZeroSumEval encompasses a diverse suite of games, including security challenges (Capture the Flag), classic board games (chess), and knowledge tests (MathQuiz). These games are designed to evaluate a range of capabilities such as strategic reasoning, planning, knowledge application, safety, and adaptability. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework for easily implementing games and leverages DSPy to provide a better abstraction for LLM player strategies.

ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition

TL;DR

ZeroSumEval introduces a dynamic, competition-based framework for evaluating large language models by pitting models against each other in a library of evolving, game-based tasks. It formalizes a modular architecture that separates game logic from strategy, and integrates DSPy-based strategies to enable high-level, prompt-efficient play. The framework supports automated verification for complex knowledge and security challenges and uses Bradley-Terry rankings with bootstrapped confidence to compare head-to-head outcomes. By delivering scalable, interpretable, and extensible benchmarks, ZeroSumEval addresses static-benchmark pitfalls such as data contamination and prompt sensitivity and paves the way for multimodal and adversarial extensions in the future.

Abstract

We introduce ZeroSumEval, a dynamic, competition-based, and evolving evaluation framework for Large Language Models (LLMs) that leverages competitive games. ZeroSumEval encompasses a diverse suite of games, including security challenges (Capture the Flag), classic board games (chess), and knowledge tests (MathQuiz). These games are designed to evaluate a range of capabilities such as strategic reasoning, planning, knowledge application, safety, and adaptability. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework for easily implementing games and leverages DSPy to provide a better abstraction for LLM player strategies.

Paper Structure

This paper contains 17 sections, 1 equation, 4 figures.

Figures (4)

  • Figure 1: A high-level example implementation of the GameState class of Chess in ZeroSumEval.
  • Figure 2: An example flow of the Game Manager for the game of Chess. The state of the game moves forward by (i) querying the current state for the next action and the player that is expected to act (ii) executing that action using the player's implementation for that action, (iii) updating the game state with that action, (iv) repeat i-iii until the game is terminated. The scores are then calculated from the final state and a winner is determined accordingly.
  • Figure 3: State diagram of the verification process involving the Game Manager and the Generator. Purple boxes indicate deterministic steps and blue boxes indicate steps involving the model.
  • Figure 4: Ratings of Llama 3 models of various subversions and sizes placed head-to-head. error bars are 95% confidence intervals of BT ratings obtained via bootstrapping.