SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models

Md Imbesat Hassan Rizvi; Xiaodan Zhu; Iryna Gurevych

SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models

Md Imbesat Hassan Rizvi, Xiaodan Zhu, Iryna Gurevych

TL;DR

This work systematically evaluates spatial reasoning in state-of-the-art LLMs using SpaRC, a property-driven framework, and SpaRP, a dataset of deductively verified reasoning paths. SpaRC defines six key properties (PO/EO, RI/RC, QS/QU) to analyze spatial contexts and composition rules, while SpaRP generates ground-truth reasoning traces for training and evaluation. The study shows that LLMs struggle with spatial reasoning, though performance improves with model size and especially with finetuning on reasoning paths; GPT-4 remains strongest overall, with proprietary models outperforming open-source ones in topological tasks. The contributions enable deeper analysis of spatial reasoning capabilities, provide ground-truth reasoning paths for explainability, and offer practical guidance for building more reliable spatially aware AI agents.

Abstract

Spatial reasoning is a crucial component of both biological and artificial intelligence. In this work, we present a comprehensive study of the capability of current state-of-the-art large language models (LLMs) on spatial reasoning. To support our study, we created and contribute a novel Spatial Reasoning Characterization (SpaRC) framework and Spatial Reasoning Paths (SpaRP) datasets, to enable an in-depth understanding of the spatial relations and compositions as well as the usefulness of spatial reasoning chains. We found that all the state-of-the-art LLMs do not perform well on the datasets -- their performances are consistently low across different setups. The spatial reasoning capability improves substantially as model sizes scale up. Finetuning both large language models (e.g., Llama-2-70B) and smaller ones (e.g., Llama-2-13B) can significantly improve their F1-scores by 7--32 absolute points. We also found that the top proprietary LLMs still significantly outperform their open-source counterparts in topological spatial understanding and reasoning.

SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models

TL;DR

Abstract

Paper Structure (37 sections, 5 figures, 15 tables, 2 algorithms)

This paper contains 37 sections, 5 figures, 15 tables, 2 algorithms.

Introduction
Related Work
Text-based Spatial Reasoning.
Reasoning Abilities of Large Language Models.
The Spatial Reasoning Characterization (SpaRC) Framework
Principle and Design of SpaRC
Fixed Orientation or Point of View (FPoV).
Point Objects (PO).
Extended Objects (EO).
Relation Incomplete (RI).
Relation Complete (RC).
Quantitatively Specified (QS).
Quantitatively Unspecified (QU).
Creation of the SpaRC Dataset
The Spatial Reasoning Paths (SpaRP)
...and 22 more sections

Figures (5)

Figure 1: Visualization of Relation Complete (RC) and Relation Incomplete (RI) contexts for the RIGHT relation for Point Objects (PO) and Extended Objects (EO).
Figure 2: Our step-by-step deductive Spatial Reasoning Paths (SpaRP) generation. A context graph and node traversal from the head to the tail entity in a question is identified and verbalized. Blue indicates context relations $r^c$, red indicates inverse context relations $r^{ic}$, and green indicates deduced relations $r^d$ between entities while traversing the reasoning path A--B--C--D--E.
Figure 3: F1 scores vs. ground truth number of hops for spatial reasoning across the datasets and models. SC=20 means self-consistency over 20 generations, and FT indicates finetuned model with greedy decoding.
Figure 4: F1 scores of individual labels across the datasets and models. SC=20 means self-consistency over 20 generations, and FT indicates finetuned model with greedy decoding.
Figure 5: An example reproduced from the StepGame stepGame2022shi.

SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models

TL;DR

Abstract

SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)