PLUGH: A Benchmark for Spatial Understanding and Reasoning in Large Language Models
Alexey Tikhonov
TL;DR
PLUGH introduces a text-based game-derived benchmark for spatial reasoning in LLMs, linking narrative transcripts to formal spatial graphs across five tasks. The dataset comprises 125 segments from 48 games, producing 61 non-isomorphic graphs, with graphs extracted via the Jericho emulator and transcripts rewritten into fiction to bridge narrative and structure. Evaluation across commercial and open-source LLMs shows GPT-4 family leading on reasoning tasks while open-source models can be competitive in some settings, yet all models exhibit notable errors such as formatting issues and location hallucinations. The authors release code and data, analyze error modes, and discuss principles and future directions to improve grounding and spatial reasoning in LLMs.
Abstract
We present PLUGH (https://www.urbandictionary.com/define.php?term=plugh), a modern benchmark that currently consists of 5 tasks, each with 125 input texts extracted from 48 different games and representing 61 different (non-isomorphic) spatial graphs to assess the abilities of Large Language Models (LLMs) for spatial understanding and reasoning. Our evaluation of API-based and open-sourced LLMs shows that while some commercial LLMs exhibit strong reasoning abilities, open-sourced competitors can demonstrate almost the same level of quality; however, all models still have significant room for improvement. We identify typical reasons for LLM failures and discuss possible ways to deal with them. Datasets and evaluation code are released (https://github.com/altsoph/PLUGH).
