Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning
Fangjun Li, David C. Hogg, Anthony G. Cohn
TL;DR
The paper addresses the gap in evaluating qualitative spatial reasoning (QSR) by introducing RoomSpace, a real-world, 3D-simulated benchmark derived from ProcTHOR that presents diverse room layouts and object relations. It casts spatial reasoning as a constraint satisfaction problem (CSP) and generates data via a configurable tuple $\langle n, d, m, p\rangle$, plus a logic-to-text framework to produce narrative stories and questions, while employing a consistency-checking tool to accommodate multiple plausible solutions. Key contributions include a comprehensive analysis of existing QSR benchmarks, the CSP-based data generation pipeline, a logic-to-text description system, and an empirical evaluation showing GPT-4 generally outperforms other models but struggles with multi-hop spatial reasoning and handling mixed view descriptions. The findings advance LM evaluation for spatial reasoning and point to directions for improving model abilities in real-world qualitative spatial tasks, with implications for multi-modal and narrative-driven reasoning in AI systems.
Abstract
Spatial reasoning plays a vital role in both human cognition and machine intelligence, prompting new research into language models' (LMs) capabilities in this regard. However, existing benchmarks reveal shortcomings in evaluating qualitative spatial reasoning (QSR). These benchmarks typically present oversimplified scenarios or unclear natural language descriptions, hindering effective evaluation. We present a novel benchmark for assessing QSR in LMs, which is grounded in realistic 3D simulation data, offering a series of diverse room layouts with various objects and their spatial relationships. This approach provides a more detailed and context-rich narrative for spatial reasoning evaluation, diverging from traditional, toy-task-oriented scenarios. Our benchmark encompasses a broad spectrum of qualitative spatial relationships, including topological, directional, and distance relations. These are presented with different viewing points, varied granularities, and density of relation constraints to mimic real-world complexities. A key contribution is our logic-based consistency-checking tool, which enables the assessment of multiple plausible solutions, aligning with real-world scenarios where spatial relationships are often open to interpretation. Our benchmark evaluation of advanced LMs reveals their strengths and limitations in spatial reasoning. They face difficulties with multi-hop spatial reasoning and interpreting a mix of different view descriptions, pointing to areas for future improvement.
