Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark

Fangjun Li; David C. Hogg; Anthony G. Cohn

Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark

Fangjun Li, David C. Hogg, Anthony G. Cohn

TL;DR

The paper revisits the StepGame spatial-reasoning benchmark, identifies and rectifies template errors to enable accurate evaluation of LLMs. It analyzes GPT-family models on the corrected dataset, finding that while large models can map natural language to spatial relations effectively, they struggle with multi-hop reasoning. A flawless, logic-based solution is demonstrated by combining sentence-to-template mapping with an ASP reasoner, achieving near-perfect accuracy. Building on this, the authors develop customized Chain-of-Thought and Tree-of-Thought prompting strategies, which substantially boost spatial-reasoning performance for large models and illustrate a viable path toward robust, integrated symbolic-neural reasoning for spatial tasks.

Abstract

Artificial intelligence (AI) has made remarkable progress across various domains, with large language models like ChatGPT gaining substantial attention for their human-like text-generation capabilities. Despite these achievements, spatial reasoning remains a significant challenge for these models. Benchmarks like StepGame evaluate AI spatial reasoning, where ChatGPT has shown unsatisfactory performance. However, the presence of template errors in the benchmark has an impact on the evaluation results. Thus there is potential for ChatGPT to perform better if these template errors are addressed, leading to more accurate assessments of its spatial reasoning capabilities. In this study, we refine the StepGame benchmark, providing a more accurate dataset for model evaluation. We analyze GPT's spatial reasoning performance on the rectified benchmark, identifying proficiency in mapping natural language text to spatial relations but limitations in multi-hop reasoning. We provide a flawless solution to the benchmark by combining template-to-relation mapping with logic-based reasoning. This combination demonstrates proficiency in performing qualitative reasoning on StepGame without encountering any errors. We then address the limitations of GPT models in spatial reasoning. We deploy Chain-of-thought and Tree-of-thoughts prompting strategies, offering insights into GPT's ``cognitive process", and achieving remarkable improvements in accuracy. Our investigation not only sheds light on model deficiencies but also proposes enhancements, contributing to the advancement of AI with more robust spatial reasoning capabilities.

Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark

TL;DR

Abstract

Paper Structure (25 sections, 3 figures, 4 tables)

This paper contains 25 sections, 3 figures, 4 tables.

Introduction
Related Work
The StepGame Benchmark for Evaluating Spatial Reasoning
Spatial Reasoning Types
Problems with the Dataset
Methods
Solution for the Corrected Benchmark
Chain-of-Thought (CoT) Prompting
Tree-of-Thoughts (ToT) Prompting
Experimental Design
Model Settings
Different Test Subsets
Different Few-Shot Sets
Experimental Results
Evaluation Results
...and 10 more sections

Figures (3)

Figure 1: An illustrative example for demonstrating relation extraction and 1-hop spatial reasoning.
Figure 2: Example of 10-hop reasoning, featuring a question regarding two entities that are not directly connected in the stories. The diagrams on the right do not form part of the input to the AI system but are for illustrative purposes only.
Figure 4: Accuracy comparison for varying numbers of hops (1-10) on the clean test set. On the left, we show the performance variation of the Turbo model with 10shot prompting over different test set sizes (30, 100, and 1000 examples). The middle section illustrates the performance of the Turbo model under three distinct prompting settings: 5shot(1,3,5,7,10), 10shot, and 5shot separate. The right portion showcases the performance of two models - Davinci and Turbo - using 10shot prompting.

Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark

TL;DR

Abstract

Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (3)