Table of Contents
Fetching ...

TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

Haochuan Wang, Xiachong Feng, Lei Li, Yu Guo, Zhanyue Qin, Dianbo Sui, Lingpeng Kong

TL;DR

TMGBench introduces a comprehensive, scalable benchmark for evaluating strategic reasoning in LLMs using 144 Robinson-Goforth 2×2 game types, augmented with five story-based variants per classic game to mitigate data leakage. It supports sequential, parallel, and nested complex forms, enabling systematic analysis of multi-task and multi-layered decision-making, with Nash equilibrium inference as the evaluation core and metrics like ID, BD, and PAR to quantify reasoning quality and consistency. Across a broad set of SOTA models, TMGBench reveals high performance on classic tasks for some models but notable weaknesses in cross-context generalization, robustness, and ToM-enabled reasoning, especially under story-based framing and in more complex task forms. The findings suggest that ToM prompting can improve performance for certain models, but higher-order ToM benefits are not universally realized, highlighting the need for more resilient prompting strategies and benchmark designs to push advancements in strategic reasoning for real-world multi-agent AI systems.

Abstract

The rapid advancement of large language models has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate the strategic reasoning capabilities of LLMs, game theory, with its concise structure, has become the preferred approach for many researchers. However, current research typically focuses on a limited selection of games, resulting in low coverage of game types. Additionally, classic game scenarios carry risks of data leakage, and the benchmarks used often lack extensibility, rendering them inadequate for evaluating state-of-the-art models. To address these challenges, we propose TMGBench, characterized by comprehensive game type coverage, diverse scenarios and flexible game organization. Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games in our benchmark; we also synthetize diverse, higher-quality game scenarios for each classic game, which we refer to as story-based games. Lastly, to provide a sustainable evaluation framework adaptable to increasingly powerful LLMs, we treat the aforementioned games as atomic units and organize them into more complex forms through sequential, parallel, and nested structures. We conducted a comprehensive evaluation of mainstream LLMs, covering tests on rational reasoning, reasoning robustness, Theory-of-Mind capabilities, and reasoning in complex game forms. The results revealed LLMs still have flaws in the accuracy and consistency of strategic reasoning processes, and their levels of mastery over Theory-of-Mind also vary. Additionally, SOTA models like o3-mini, Qwen3 and deepseek-reasoner, were also evaluated across the sequential, parallel, and nested game structures while the results highlighted the challenges posed by TMGBench.

TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

TL;DR

TMGBench introduces a comprehensive, scalable benchmark for evaluating strategic reasoning in LLMs using 144 Robinson-Goforth 2×2 game types, augmented with five story-based variants per classic game to mitigate data leakage. It supports sequential, parallel, and nested complex forms, enabling systematic analysis of multi-task and multi-layered decision-making, with Nash equilibrium inference as the evaluation core and metrics like ID, BD, and PAR to quantify reasoning quality and consistency. Across a broad set of SOTA models, TMGBench reveals high performance on classic tasks for some models but notable weaknesses in cross-context generalization, robustness, and ToM-enabled reasoning, especially under story-based framing and in more complex task forms. The findings suggest that ToM prompting can improve performance for certain models, but higher-order ToM benefits are not universally realized, highlighting the need for more resilient prompting strategies and benchmark designs to push advancements in strategic reasoning for real-world multi-agent AI systems.

Abstract

The rapid advancement of large language models has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate the strategic reasoning capabilities of LLMs, game theory, with its concise structure, has become the preferred approach for many researchers. However, current research typically focuses on a limited selection of games, resulting in low coverage of game types. Additionally, classic game scenarios carry risks of data leakage, and the benchmarks used often lack extensibility, rendering them inadequate for evaluating state-of-the-art models. To address these challenges, we propose TMGBench, characterized by comprehensive game type coverage, diverse scenarios and flexible game organization. Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games in our benchmark; we also synthetize diverse, higher-quality game scenarios for each classic game, which we refer to as story-based games. Lastly, to provide a sustainable evaluation framework adaptable to increasingly powerful LLMs, we treat the aforementioned games as atomic units and organize them into more complex forms through sequential, parallel, and nested structures. We conducted a comprehensive evaluation of mainstream LLMs, covering tests on rational reasoning, reasoning robustness, Theory-of-Mind capabilities, and reasoning in complex game forms. The results revealed LLMs still have flaws in the accuracy and consistency of strategic reasoning processes, and their levels of mastery over Theory-of-Mind also vary. Additionally, SOTA models like o3-mini, Qwen3 and deepseek-reasoner, were also evaluated across the sequential, parallel, and nested game structures while the results highlighted the challenges posed by TMGBench.

Paper Structure

This paper contains 35 sections, 1 equation, 13 figures, 6 tables.

Figures (13)

  • Figure 1: An concept map of TMGBench. The data preparation of the benchmark includes 3 ingredients: Robinson-Goforth topology, game structure and contextual framing. The evaluation of the benchmark embraces several prompting methods (including ToM promptings) to elicit strategic reasoning process of LLMs.
  • Figure 2: We design several complex forms of strategic reasoning tasks using TMGBench. which include: (1) sequential form, where LLMs are required to respond to multiple game tasks in a row, with history of previous tasks; (2) parallel form, where LLMs are required to response multiple game tasks simultaneously; (3) nested form, where LLMs are required to response a set of interlinked game tasks (in our settings, we relate to them as pre-game and core-game). Games in the complex forms can be selected with different game structures and various contexts.
  • Figure 3: Demonstration of the inconsistency heat map. Each of the grids is divided into 4 quarter-grids, indicating the 4 situations. By subtracting the standard map from the practical map element-wise, we get the inconsistency map, where blue colours indicate positive difference and red colours indicate negative difference. The deeper the colour means the larger the difference between the LLM's response and the standard answer.
  • Figure 4: Axisymmetry in heat maps can be illustrated by the left sub-figure, where the standard heat map exhibits perfect axisymmetry across the counter-diagonal. In contrast, LLMs' responses tend to demonstrate quasi-axisymmetry, as shown by the right sub-figure. Certain pairs of positions fail to align precisely when reflected across the axis and may exhibit discrepancies, deviating from the ideal symmetric pattern.
  • Figure 5: Radar charts of the 9 sub-metrics of LLMs' performance, comparing the DA prompting (left side) and the CoT prompting (right side). $\mathrm{AntiID}$ and $\mathrm{AntiBD}$ are derived from $\mathrm{ID}$ and $\mathrm{BD}$ while higher values indicate better performances (in order to consistent with $\mathrm{PAR}$).
  • ...and 8 more figures