Table of Contents
Fetching ...

GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps

Muhammad Umair Nasir, Steven James, Julian Togelius

TL;DR

This work investigates large language models' planning capabilities by proposing GameTraversalBenchmark (GTB), a benchmark consisting of diverse 2D grid-based game maps and finds that GPT-4-Turbo achieved the highest score of 44.97% on GTB, indicating that the benchmark remains challenging for current models.

Abstract

Large language models (LLMs) have recently demonstrated great success in generating and understanding natural language. While they have also shown potential beyond the domain of natural language, it remains an open question as to what extent and in which way these LLMs can plan. We investigate their planning capabilities by proposing GameTraversalBenchmark (GTB), a benchmark consisting of diverse 2D grid-based game maps. An LLM succeeds if it can traverse through given objectives, with a minimum number of steps and a minimum number of generation errors. We evaluate a number of LLMs on GTB and found that GPT-4-Turbo achieved the highest score of 44.97% on GTB\_Score (GTBS), a composite score that combines the three above criteria. Furthermore, we preliminarily test large reasoning models, namely o1, which scores $67.84\%$ on GTBS, indicating that the benchmark remains challenging for current models. Code, data, and documentation are available at https://github.com/umair-nasir14/Game-Traversal-Benchmark.

GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps

TL;DR

This work investigates large language models' planning capabilities by proposing GameTraversalBenchmark (GTB), a benchmark consisting of diverse 2D grid-based game maps and finds that GPT-4-Turbo achieved the highest score of 44.97% on GTB, indicating that the benchmark remains challenging for current models.

Abstract

Large language models (LLMs) have recently demonstrated great success in generating and understanding natural language. While they have also shown potential beyond the domain of natural language, it remains an open question as to what extent and in which way these LLMs can plan. We investigate their planning capabilities by proposing GameTraversalBenchmark (GTB), a benchmark consisting of diverse 2D grid-based game maps. An LLM succeeds if it can traverse through given objectives, with a minimum number of steps and a minimum number of generation errors. We evaluate a number of LLMs on GTB and found that GPT-4-Turbo achieved the highest score of 44.97% on GTB\_Score (GTBS), a composite score that combines the three above criteria. Furthermore, we preliminarily test large reasoning models, namely o1, which scores on GTBS, indicating that the benchmark remains challenging for current models. Code, data, and documentation are available at https://github.com/umair-nasir14/Game-Traversal-Benchmark.

Paper Structure

This paper contains 4 sections, 6 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Top: An example of a level produced by the Word2World nasir2024word2world algorithm and rendered using its tileset. Middle: Binary mask of the above map for visualising walkable tiles (teal), non-walkable tiles (purple), and the position of objectives (numbered). Bottom: Character representation of the map provided as input to the LLM.
  • Figure 2: An illustration of the GTB evaluation loop. Each game map $M$ is evaluated turn-by-turn for all objectives $N$ present in it. A game state $S$ includes the game map, the position of LLM agent, the position of objectives, and current rewards. The updated state, $S + 1$, has the updated position of the LLM agent in it and updated rewards. LLM agent produces a sequence of actions for that particular objective and is evaluated for that objective. Once all objectives are iterated over, the agent evaluation is stored and the loop moves to the next map. Once all maps are evaluated, the GTB metrics are calculated.
  • Figure 3: Illustrates an example of an input to the LLM and the output of the action sequence for the objective.
  • Figure 4: Example of a prompt in GTB to generate actions.