Table of Contents
Fetching ...

GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, Arjun Yadav

TL;DR

GameBench introduces a cross-domain benchmark to quantify strategic reasoning in LLM agents using nine diverse games designed to be out-of-distribution with respect to pretraining data. It evaluates GPT-3.5-turbo and GPT-4 base models, with Chain-of-Thought (CoT) and Reasoning Via Planning (RAP) scaffolds, against random and human baselines, and analyzes performance through an exponential Bradley–Terry rating framework with bootstrapping. Key findings show that neither model matches human performance; CoT markedly improves GPT-4’s strategic reasoning while RAP yields mixed results and often lags behind CoT, highlighting limitations in current LLMs on complex multi-agent tasks. The work demonstrates both the promise and the limits of scaffolding for strategic reasoning and provides a framework for future out-of-distribution evaluation and scaffold development.

Abstract

Large language models have demonstrated remarkable few-shot performance on many natural language understanding tasks. Despite several demonstrations of using large language models in complex, strategic scenarios, there lacks a comprehensive framework for evaluating agents' performance across various types of reasoning found in games. To address this gap, we introduce GameBench, a cross-domain benchmark for evaluating strategic reasoning abilities of LLM agents. We focus on 9 different game environments, where each covers at least one axis of key reasoning skill identified in strategy games, and select games for which strategy explanations are unlikely to form a significant portion of models' pretraining corpuses. Our evaluations use GPT-3 and GPT-4 in their base form along with two scaffolding frameworks designed to enhance strategic reasoning ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP). Our results show that none of the tested models match human performance, and at worst GPT-4 performs worse than random action. CoT and RAP both improve scores but not comparable to human levels.

GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

TL;DR

GameBench introduces a cross-domain benchmark to quantify strategic reasoning in LLM agents using nine diverse games designed to be out-of-distribution with respect to pretraining data. It evaluates GPT-3.5-turbo and GPT-4 base models, with Chain-of-Thought (CoT) and Reasoning Via Planning (RAP) scaffolds, against random and human baselines, and analyzes performance through an exponential Bradley–Terry rating framework with bootstrapping. Key findings show that neither model matches human performance; CoT markedly improves GPT-4’s strategic reasoning while RAP yields mixed results and often lags behind CoT, highlighting limitations in current LLMs on complex multi-agent tasks. The work demonstrates both the promise and the limits of scaffolding for strategic reasoning and provides a framework for future out-of-distribution evaluation and scaffold development.

Abstract

Large language models have demonstrated remarkable few-shot performance on many natural language understanding tasks. Despite several demonstrations of using large language models in complex, strategic scenarios, there lacks a comprehensive framework for evaluating agents' performance across various types of reasoning found in games. To address this gap, we introduce GameBench, a cross-domain benchmark for evaluating strategic reasoning abilities of LLM agents. We focus on 9 different game environments, where each covers at least one axis of key reasoning skill identified in strategy games, and select games for which strategy explanations are unlikely to form a significant portion of models' pretraining corpuses. Our evaluations use GPT-3 and GPT-4 in their base form along with two scaffolding frameworks designed to enhance strategic reasoning ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP). Our results show that none of the tested models match human performance, and at worst GPT-4 performs worse than random action. CoT and RAP both improve scores but not comparable to human levels.
Paper Structure (22 sections, 3 equations, 10 figures, 3 tables)

This paper contains 22 sections, 3 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Rating data With CoT scaffolding, GPT-4 is the best reasoner below only the human baseline, achieving the best LLM performance on Sea Battle and Pit. But without, it performs worse than even the random baseline due to its exceedingly low rating on Sea Battle. The state-of-the-art RAP scaffolding doesn't provide as much of an improvement to GPT-4 as CoT does. Looking at the top line of Figure \ref{['fig:rating_scatter']} reveal the best agent in each game. come from exponential Bradley–Terry model. See section \ref{['rating_model']} for details. The whiskers represent 90% CIs computed from our bootstrapping process formalized in \ref{['rating_model']}. ALS = Air, Land, Sea; ARC = Arctic Scavengers; AYT = Are You the Traitor?; CN = Codenames; HV = Hive; PT = Pit; SN = Santorini; TRB = Two Rooms and a Boom; SB = Sea Battle.
  • Figure 2: Number of matches recorded The random baseline and faster games were oversampled due to their low cost.
  • Figure :
  • Figure :
  • Figure :
  • ...and 5 more figures