Table of Contents
Fetching ...

PLUGH: A Benchmark for Spatial Understanding and Reasoning in Large Language Models

Alexey Tikhonov

TL;DR

PLUGH introduces a text-based game-derived benchmark for spatial reasoning in LLMs, linking narrative transcripts to formal spatial graphs across five tasks. The dataset comprises 125 segments from 48 games, producing 61 non-isomorphic graphs, with graphs extracted via the Jericho emulator and transcripts rewritten into fiction to bridge narrative and structure. Evaluation across commercial and open-source LLMs shows GPT-4 family leading on reasoning tasks while open-source models can be competitive in some settings, yet all models exhibit notable errors such as formatting issues and location hallucinations. The authors release code and data, analyze error modes, and discuss principles and future directions to improve grounding and spatial reasoning in LLMs.

Abstract

We present PLUGH (https://www.urbandictionary.com/define.php?term=plugh), a modern benchmark that currently consists of 5 tasks, each with 125 input texts extracted from 48 different games and representing 61 different (non-isomorphic) spatial graphs to assess the abilities of Large Language Models (LLMs) for spatial understanding and reasoning. Our evaluation of API-based and open-sourced LLMs shows that while some commercial LLMs exhibit strong reasoning abilities, open-sourced competitors can demonstrate almost the same level of quality; however, all models still have significant room for improvement. We identify typical reasons for LLM failures and discuss possible ways to deal with them. Datasets and evaluation code are released (https://github.com/altsoph/PLUGH).

PLUGH: A Benchmark for Spatial Understanding and Reasoning in Large Language Models

TL;DR

PLUGH introduces a text-based game-derived benchmark for spatial reasoning in LLMs, linking narrative transcripts to formal spatial graphs across five tasks. The dataset comprises 125 segments from 48 games, producing 61 non-isomorphic graphs, with graphs extracted via the Jericho emulator and transcripts rewritten into fiction to bridge narrative and structure. Evaluation across commercial and open-source LLMs shows GPT-4 family leading on reasoning tasks while open-source models can be competitive in some settings, yet all models exhibit notable errors such as formatting issues and location hallucinations. The authors release code and data, analyze error modes, and discuss principles and future directions to improve grounding and spatial reasoning in LLMs.

Abstract

We present PLUGH (https://www.urbandictionary.com/define.php?term=plugh), a modern benchmark that currently consists of 5 tasks, each with 125 input texts extracted from 48 different games and representing 61 different (non-isomorphic) spatial graphs to assess the abilities of Large Language Models (LLMs) for spatial understanding and reasoning. Our evaluation of API-based and open-sourced LLMs shows that while some commercial LLMs exhibit strong reasoning abilities, open-sourced competitors can demonstrate almost the same level of quality; however, all models still have significant room for improvement. We identify typical reasons for LLM failures and discuss possible ways to deal with them. Datasets and evaluation code are released (https://github.com/altsoph/PLUGH).
Paper Structure (16 sections, 8 figures, 6 tables)

This paper contains 16 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The principal schema of our approach.
  • Figure 2: An example of a good segment from the Asgard game passed all filters.
  • Figure 3: The distribution of path lengths across the graphs in tasks 2a and 2b.
  • Figure 4: The distribution of path lengths across the graphs in Task 4.
  • Figure 5: A spatial graph reconstructed from a text of the Winnie-the-Pooh book.
  • ...and 3 more figures