Evaluating Spatial Understanding of Large Language Models

Yutaro Yamada; Yihan Bao; Andrew K. Lampinen; Jungo Kasai; Ilker Yildirim

Evaluating Spatial Understanding of Large Language Models

Yutaro Yamada, Yihan Bao, Andrew K. Lampinen, Jungo Kasai, Ilker Yildirim

TL;DR

This work probes whether text-only large language models implicitly possess spatial knowledge by using sequential, natural-language navigation tasks across diverse topologies (square, hexagon, triangle, ring, tree). It systematically compares GPT-3.5-turbo, GPT-4, and multiple Llama/CodeLlama variants under zero-shot conditions, analyzing accuracy, input feeding orders, and local vs global map construction. Key findings show structure-dependent performance, with square-like layouts easiest and certain topologies eliciting distinct error biases (spatial vs. temporal); local presentations generally outperform global ones, and input encoding can shape internal representations. The results indicate that LLMs capture some spatial structure aspects but there is substantial room for improvement and more robust grounding methods.

Abstract

Large language models (LLMs) show remarkable capabilities across a variety of tasks. Despite the models only seeing text in training, several recent studies suggest that LLM representations implicitly capture aspects of the underlying grounded concepts. Here, we explore LLM representations of a particularly salient kind of grounded knowledge -- spatial relationships. We design natural-language navigation tasks and evaluate the ability of LLMs, in particular GPT-3.5-turbo, GPT-4, and Llama2 series models, to represent and reason about spatial structures. These tasks reveal substantial variability in LLM performance across different spatial structures, including square, hexagonal, and triangular grids, rings, and trees. In extensive error analysis, we find that LLMs' mistakes reflect both spatial and non-spatial factors. These findings suggest that LLMs appear to capture certain aspects of spatial structure implicitly, but room for improvement remains.

Evaluating Spatial Understanding of Large Language Models

TL;DR

Abstract

Paper Structure (22 sections, 23 figures, 6 tables)

This paper contains 22 sections, 23 figures, 6 tables.

Introduction
Spatial Understanding Task
Models and evaluation metrics
Results
Do different spatial structure features affect model performance?
Is building a local map more difficult than building a full map and retrieving a path?
The order of presenting the map impacts spatial understanding
Relational structure: Tree
Grid size inference from sequences of navigational instructions
Error analysis
Comparing error distributions of square, triangular, and hexagonal grids
Comparison to human baseline
Related Work
Conclusion
Additional prompt examples
...and 7 more sections

Figures (23)

Figure 1: The spatial structures we examine for the underlying maps include squares, triangles, hexagons and rings. Additionally, we analyze a tree structure to explore its relational nature.
Figure 2: Example question and its answer for square, triangle, hexagon and ring structure.
Figure 3: We compare the accuracy of the models across the different spatial structures. The random guessing accuracy is 1/8 since the predictions from random guessing are uniformly selected from the nodes encountered by the models, which corresponds to the local path with 8 navigation steps. GPT-4 have higher prediction accuracy than random guessing in square, ring and triangle structures, but worse in hexagon. ChatGPT exhibits lower prediction accuracy than random guessing across all of these structures. Llama2-70B and CodeLlama-34B shows a similar pattern to GPT-4. The error bars indicate 1.96 times standard error across 5 runs.
Figure 4: Example question and its answer for square and ring structure under the global setting.
Figure 5: Performance is evaluated on GPT-4, Llama2-70B, and CodeLlama-34B. For both square and ring structures, we observe that the prediction accuracy of GPT-4 using the local map is higher compared to the global map. Llama2-70B and CodeLlama-34B show a similar pattern for the square, while the pattern is less clear for the ring. The error bars indicate 1.96 times standard error across 5 runs.
...and 18 more figures

Evaluating Spatial Understanding of Large Language Models

TL;DR

Abstract

Evaluating Spatial Understanding of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (23)