Table of Contents
Fetching ...

Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited

Anthony G Cohn, Robert E Blackwell

TL;DR

The paper evaluates 28 Large Language Models on reasoning about cardinal directions using a large, template-generated benchmark (5760 questions) to test spatial reasoning in non-embodied systems. It extends prior COSIT-24 work by analyzing Large Reasoning Models (LRMs) that report inference-time reasoning and by conducting extensive experiments with zero-shot prompts and detailed token-level analyses. Key findings show no model fully solves CD reasoning; the best LRMs achieve high but incomplete accuracy (e.g., around 0.92 for o1), while other models exhibit substantial variability across templates, locomotion, and intercardinal vs cardinal directions, with intercardinal cases requiring more reasoning tokens. The work highlights the limits of current CD reasoning in LLMs, the importance of precise experimental conditions, and sets directions for improving prompting, multilingual benchmarking, and scalable, reproducible evaluation of spatial reasoning in AI systems.

Abstract

We investigate the abilities of 28 Large language Models (LLMs) to reason about cardinal directions (CDs) using a benchmark generated from a set of templates, extensively testing an LLM's ability to determine the correct CD given a particular scenario. The templates allow for a number of degrees of variation such as means of locomotion of the agent involved, and whether set in the first, second or third person. Even the newer Large Reasoning Models are unable to reliably determine the correct CD for all questions. This paper summarises and extends earlier work presented at COSIT-24.

Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited

TL;DR

The paper evaluates 28 Large Language Models on reasoning about cardinal directions using a large, template-generated benchmark (5760 questions) to test spatial reasoning in non-embodied systems. It extends prior COSIT-24 work by analyzing Large Reasoning Models (LRMs) that report inference-time reasoning and by conducting extensive experiments with zero-shot prompts and detailed token-level analyses. Key findings show no model fully solves CD reasoning; the best LRMs achieve high but incomplete accuracy (e.g., around 0.92 for o1), while other models exhibit substantial variability across templates, locomotion, and intercardinal vs cardinal directions, with intercardinal cases requiring more reasoning tokens. The work highlights the limits of current CD reasoning in LLMs, the importance of precise experimental conditions, and sets directions for improving prompting, multilingual benchmarking, and scalable, reproducible evaluation of spatial reasoning in AI systems.

Abstract

We investigate the abilities of 28 Large language Models (LLMs) to reason about cardinal directions (CDs) using a benchmark generated from a set of templates, extensively testing an LLM's ability to determine the correct CD given a particular scenario. The templates allow for a number of degrees of variation such as means of locomotion of the agent involved, and whether set in the first, second or third person. Even the newer Large Reasoning Models are unable to reliably determine the correct CD for all questions. This paper summarises and extends earlier work presented at COSIT-24.

Paper Structure

This paper contains 6 sections, 5 figures.

Figures (5)

  • Figure 1: Accuracy by LLM. Results shaded in blue are from cohn_et_al:LIPIcs.COSIT.2024.28. Results with a black border are LRMs. The red dotted line is the guess rate (0.125, since there are eight possible answers). Where possible we use the model names and versions reported in the LLM response, prefixed by the API provider (e.g. openai or azure). Models run locally are prefixed with ollama.
  • Figure 2: Confusion matrix for the best performing model, o1.
  • Figure 3: Median and inter-quartile range of o1 reasoning tokens by ground truth direction.
  • Figure 4: Reasoning token counts for correct and incorrect answers by LRM. The white bar shows the median. Note that all LRMs tested have more correct than incorrect answers and so incorrect sample sizes are small.
  • Figure 5: Accuracy by direction, locomotion, person form, and question template for selected models. The grey concentric circles in the background are set at an accuracy of 0.1 apart.