Table of Contents
Fetching ...

The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

Simon Henniger, Gabriel Poesia

TL;DR

TTG is designed as an evaluation framework where models challenge each other by creating their own puzzles, and suggests new paradigms for evaluating reasoning that cannot be saturated by design, and that allow testing models for other skills like creativity and task creation alongside problem solving.

Abstract

Evaluating the reasoning capabilities of Large Language Models is increasingly challenging as models improve. Human curation of hard questions is highly expensive, especially in recent benchmarks using PhD-level domain knowledge to challenge the most capable models. Even then, there is always a concern about whether these questions test genuine reasoning or if similar problems have been seen during training. Here, we take inspiration from 16th-century mathematical duels to design The Token Games (TTG): an evaluation framework where models challenge each other by creating their own puzzles. We leverage the format of Programming Puzzles - given a Python function that returns a boolean, find inputs that make it return True - to flexibly represent problems and enable verifying solutions. Using results from pairwise duels, we then compute Elo ratings, allowing us to compare models relative to each other. We evaluate 10 frontier models on TTG, and closely match the ranking from existing benchmarks such as Humanity's Last Exam, without involving any human effort in creating puzzles. We also find that creating good puzzles is still a highly challenging task for current models, not measured by previous benchmarks. Overall, our work suggests new paradigms for evaluating reasoning that cannot be saturated by design, and that allow testing models for other skills like creativity and task creation alongside problem solving.

The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

TL;DR

TTG is designed as an evaluation framework where models challenge each other by creating their own puzzles, and suggests new paradigms for evaluating reasoning that cannot be saturated by design, and that allow testing models for other skills like creativity and task creation alongside problem solving.

Abstract

Evaluating the reasoning capabilities of Large Language Models is increasingly challenging as models improve. Human curation of hard questions is highly expensive, especially in recent benchmarks using PhD-level domain knowledge to challenge the most capable models. Even then, there is always a concern about whether these questions test genuine reasoning or if similar problems have been seen during training. Here, we take inspiration from 16th-century mathematical duels to design The Token Games (TTG): an evaluation framework where models challenge each other by creating their own puzzles. We leverage the format of Programming Puzzles - given a Python function that returns a boolean, find inputs that make it return True - to flexibly represent problems and enable verifying solutions. Using results from pairwise duels, we then compute Elo ratings, allowing us to compare models relative to each other. We evaluate 10 frontier models on TTG, and closely match the ranking from existing benchmarks such as Humanity's Last Exam, without involving any human effort in creating puzzles. We also find that creating good puzzles is still a highly challenging task for current models, not measured by previous benchmarks. Overall, our work suggests new paradigms for evaluating reasoning that cannot be saturated by design, and that allow testing models for other skills like creativity and task creation alongside problem solving.
Paper Structure (32 sections, 2 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 32 sections, 2 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Illustration of a reasoning duel in The Token Games. Two language models take turns between being proposers (creating puzzles to challenge the opponent) or solvers, attempting to find correct answers. Proposers score in a turn if they design a puzzle, give a correct solution, and have the opponent fail to solve the challeng.e Puzzles are represented as Python functions returning a boolean value, with the challenge consisting of finding inputs to make it return true. We can thus verify both proposer's and solver's solution attempts, decide the outcome of each round and of the overall duel after a fixed number of turns.
  • Figure 2: Solve rates by turn number. After completing all duels, we mined puzzles from their logs and had GPT-5.2 and GPT-5-Mini try the puzzles. We show each model's solve rate for puzzles by how early in the duel they were originally created. Trend lines computed by linear regression.
  • Figure 3: Failure modes for proposers. Did the model's own sample solution fail (red bar) or did the solver succeed in solving the puzzle (blue bar)? Each model had 90 opportunities to propose puzzles. See Table \ref{['combined']} for all puzzle outcomes.