Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning

Fangjun Li; David C. Hogg; Anthony G. Cohn

Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning

Fangjun Li, David C. Hogg, Anthony G. Cohn

TL;DR

The paper addresses the gap in evaluating qualitative spatial reasoning (QSR) by introducing RoomSpace, a real-world, 3D-simulated benchmark derived from ProcTHOR that presents diverse room layouts and object relations. It casts spatial reasoning as a constraint satisfaction problem (CSP) and generates data via a configurable tuple $\langle n, d, m, p\rangle$, plus a logic-to-text framework to produce narrative stories and questions, while employing a consistency-checking tool to accommodate multiple plausible solutions. Key contributions include a comprehensive analysis of existing QSR benchmarks, the CSP-based data generation pipeline, a logic-to-text description system, and an empirical evaluation showing GPT-4 generally outperforms other models but struggles with multi-hop spatial reasoning and handling mixed view descriptions. The findings advance LM evaluation for spatial reasoning and point to directions for improving model abilities in real-world qualitative spatial tasks, with implications for multi-modal and narrative-driven reasoning in AI systems.

Abstract

Spatial reasoning plays a vital role in both human cognition and machine intelligence, prompting new research into language models' (LMs) capabilities in this regard. However, existing benchmarks reveal shortcomings in evaluating qualitative spatial reasoning (QSR). These benchmarks typically present oversimplified scenarios or unclear natural language descriptions, hindering effective evaluation. We present a novel benchmark for assessing QSR in LMs, which is grounded in realistic 3D simulation data, offering a series of diverse room layouts with various objects and their spatial relationships. This approach provides a more detailed and context-rich narrative for spatial reasoning evaluation, diverging from traditional, toy-task-oriented scenarios. Our benchmark encompasses a broad spectrum of qualitative spatial relationships, including topological, directional, and distance relations. These are presented with different viewing points, varied granularities, and density of relation constraints to mimic real-world complexities. A key contribution is our logic-based consistency-checking tool, which enables the assessment of multiple plausible solutions, aligning with real-world scenarios where spatial relationships are often open to interpretation. Our benchmark evaluation of advanced LMs reveals their strengths and limitations in spatial reasoning. They face difficulties with multi-hop spatial reasoning and interpreting a mix of different view descriptions, pointing to areas for future improvement.

Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning

TL;DR

, plus a logic-to-text framework to produce narrative stories and questions, while employing a consistency-checking tool to accommodate multiple plausible solutions. Key contributions include a comprehensive analysis of existing QSR benchmarks, the CSP-based data generation pipeline, a logic-to-text description system, and an empirical evaluation showing GPT-4 generally outperforms other models but struggles with multi-hop spatial reasoning and handling mixed view descriptions. The findings advance LM evaluation for spatial reasoning and point to directions for improving model abilities in real-world qualitative spatial tasks, with implications for multi-modal and narrative-driven reasoning in AI systems.

Abstract

Paper Structure (34 sections, 14 figures, 2 tables)

This paper contains 34 sections, 14 figures, 2 tables.

Introduction
Analysis of Existing QSR in Text Datasets/Benchmarks
bAbI
StepGame
SpartQA, SpaRTUN:
Data Generation Framework
Problem Definition
Data Generation Process
Define House Scenes and Objects
Specify Spatial Relationships
Object Layout within Room
Directional Relations.
Topological Relations.
Relations between Objects
Directional Relations.
...and 19 more sections

Figures (14)

Figure 1: One test instance in our benchmark, consisting only of text for evaluating LMs. The accompanying images are for visualization but could be used to test multi-modal LLMs.
Figure 2: Examples of Task 17 and Task 19 from the bAbI's envalid-10k dataset version.
Figure 3: Illustration of directional spatial relationships and test instance constraint chain building process in StepGame.
Figure 4: A test example in SpaRTUN.
Figure 5: Sample scenes from our dataset showcasing four types of rooms in a top-down view.
...and 9 more figures

Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning

TL;DR

Abstract

Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (14)