Table of Contents
Fetching ...

Can Large Language Models Reason about the Region Connection Calculus?

Anthony G Cohn, Robert E Blackwell

TL;DR

This study evaluates whether large language models can perform RCC-8 qualitative spatial reasoning, using three experiment pairs that target relational composition tables, human-preference alignment, and conceptual neighbourhood reasoning. Across a diverse set of models and both eponymous and anonymised relation names, the results show that LLMs struggle to reliably reconstruct $CT$ and to fully align with human cognitive preferences, though some models approach human-like cues on specific tasks and the CN task is relatively easier. The work highlights substantial stochasticity and biases in LLM reasoning for spatial tasks, and it emphasizes the need for dedicated QSR benchmarks, prompts, or multimodal approaches to achieve robust symbolic spatial reasoning. Overall, the findings suggest LLMs are not yet reliable symbolic reasoners for RCC-8, but offer a framework and dataset for rigorous future evaluation and a clear direction for follow-up studies in more tractable calculi and prompting strategies.

Abstract

Qualitative Spatial Reasoning is a well explored area of Knowledge Representation and Reasoning and has multiple applications ranging from Geographical Information Systems to Robotics and Computer Vision. Recently, many claims have been made for the reasoning capabilities of Large Language Models (LLMs). Here, we investigate the extent to which a set of representative LLMs can perform classical qualitative spatial reasoning tasks on the mereotopological Region Connection Calculus, RCC-8. We conduct three pairs of experiments (reconstruction of composition tables, alignment to human composition preferences, conceptual neighbourhood reconstruction) using state-of-the-art LLMs; in each pair one experiment uses eponymous relations and one, anonymous relations (to test the extent to which the LLM relies on knowledge about the relation names obtained during training). All instances are repeated 30 times to measure the stochasticity of the LLMs.

Can Large Language Models Reason about the Region Connection Calculus?

TL;DR

This study evaluates whether large language models can perform RCC-8 qualitative spatial reasoning, using three experiment pairs that target relational composition tables, human-preference alignment, and conceptual neighbourhood reasoning. Across a diverse set of models and both eponymous and anonymised relation names, the results show that LLMs struggle to reliably reconstruct and to fully align with human cognitive preferences, though some models approach human-like cues on specific tasks and the CN task is relatively easier. The work highlights substantial stochasticity and biases in LLM reasoning for spatial tasks, and it emphasizes the need for dedicated QSR benchmarks, prompts, or multimodal approaches to achieve robust symbolic spatial reasoning. Overall, the findings suggest LLMs are not yet reliable symbolic reasoners for RCC-8, but offer a framework and dataset for rigorous future evaluation and a clear direction for follow-up studies in more tractable calculi and prompting strategies.

Abstract

Qualitative Spatial Reasoning is a well explored area of Knowledge Representation and Reasoning and has multiple applications ranging from Geographical Information Systems to Robotics and Computer Vision. Recently, many claims have been made for the reasoning capabilities of Large Language Models (LLMs). Here, we investigate the extent to which a set of representative LLMs can perform classical qualitative spatial reasoning tasks on the mereotopological Region Connection Calculus, RCC-8. We conduct three pairs of experiments (reconstruction of composition tables, alignment to human composition preferences, conceptual neighbourhood reconstruction) using state-of-the-art LLMs; in each pair one experiment uses eponymous relations and one, anonymous relations (to test the extent to which the LLM relies on knowledge about the relation names obtained during training). All instances are repeated 30 times to measure the stochasticity of the LLMs.

Paper Structure

This paper contains 20 sections, 9 figures, 8 tables.

Figures (9)

  • Figure 1: The eight relations of the RCC-8 calculus illustrated in 2D cohn1997qualitative: DC (Disconnected), EC (Externally Connected), PO (Partially Overlapping), TPP (Tangential Proper Part), NTPP (Nontangential Proper Part) and EQ (Equals); TPPi and NTPPi are the inverses of TPP and NTPP respectively since they are asymmetric.
  • Figure 2: RCC8 CT shaded by mean Jaccard Index (n=30 repeats) for the best performing model, Claude-3.5S and the worst performing model GPT-3.5T. The entry in each cell uses the following coding: D (DC), E (EC), P (PO), T (TPP), N (NTPP), t (TPPi), n (NTPPi), Q (EQ). The full results are in the appendix, Table \ref{['rcc8-composition-table']}.
  • Figure 3: Relation statistics for the CT for RCC-8. The left hand chart shows the absolute number of relations, and the right hand the relative percentage for each relation across all thirty repeats. All is the aggregate of all relations.
  • Figure 4: Same as for Figure \ref{['ct-jaccard']} but for anonymous.
  • Figure 5: RCC8 preferred relations shaded by mean Jaccard Index (n=30 repeats) for the best performing model, Claude-3.5S and the worst performing model Llama-3 70B. Labels show preferred relations as reported by Ragni et al ragni2007cross.
  • ...and 4 more figures