Table of Contents
Fetching ...

Evaluating the Ability of Large Language Models to Reason about Cardinal Directions

Anthony G Cohn, Robert E Blackwell

TL;DR

This paper investigates the ability of large language models to reason about cardinal directions (CDs). It introduces two datasets: a small recall-focused CD set co-created with ChatGPT and a large template-driven CD reasoning dataset that varies locomotion, perspective, and CD pairs to test compositional reasoning. Empirically, LLMs perform well on the small recall task but fail to reliably solve the large template-based CD questions, with the best reported accuracy around 0.595 and notable rubric-adherence issues. The findings highlight a gap between memorized world knowledge and structured spatial reasoning in current LLMs and point to the need for improved prompting strategies and richer benchmarks for geographic/spatial reasoning.

Abstract

We investigate the abilities of a representative set of Large language Models (LLMs) to reason about cardinal directions (CDs). To do so, we create two datasets: the first, co-created with ChatGPT, focuses largely on recall of world knowledge about CDs; the second is generated from a set of templates, comprehensively testing an LLM's ability to determine the correct CD given a particular scenario. The templates allow for a number of degrees of variation such as means of locomotion of the agent involved, and whether set in the first , second or third person. Even with a temperature setting of zero, Our experiments show that although LLMs are able to perform well in the simpler dataset, in the second more complex dataset no LLM is able to reliably determine the correct CD, even with a temperature setting of zero.

Evaluating the Ability of Large Language Models to Reason about Cardinal Directions

TL;DR

This paper investigates the ability of large language models to reason about cardinal directions (CDs). It introduces two datasets: a small recall-focused CD set co-created with ChatGPT and a large template-driven CD reasoning dataset that varies locomotion, perspective, and CD pairs to test compositional reasoning. Empirically, LLMs perform well on the small recall task but fail to reliably solve the large template-based CD questions, with the best reported accuracy around 0.595 and notable rubric-adherence issues. The findings highlight a gap between memorized world knowledge and structured spatial reasoning in current LLMs and point to the need for improved prompting strategies and richer benchmarks for geographic/spatial reasoning.

Abstract

We investigate the abilities of a representative set of Large language Models (LLMs) to reason about cardinal directions (CDs). To do so, we create two datasets: the first, co-created with ChatGPT, focuses largely on recall of world knowledge about CDs; the second is generated from a set of templates, comprehensively testing an LLM's ability to determine the correct CD given a particular scenario. The templates allow for a number of degrees of variation such as means of locomotion of the agent involved, and whether set in the first , second or third person. Even with a temperature setting of zero, Our experiments show that although LLMs are able to perform well in the simpler dataset, in the second more complex dataset no LLM is able to reliably determine the correct CD, even with a temperature setting of zero.
Paper Structure (6 sections, 4 figures, 1 table)

This paper contains 6 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: (a) and (c) show accuracy by LLM and confusion matrix respectively, for small question set; (b) and (d) for large. Answers that cannot be interpreted as a CD or an inter-CD are considered invalid. To avoid bias from three gpt-35-turbo models, the confusion matrices exclude gpt-35-turbo-0613 and gpt-35-turbo-1106 but include gpt-35-turbo-0125.
  • Figure 2: Accuracy by (a) question template, (b) direction, (c) locomotion and (d) person form for large. To avoid bias from using three gpt-35-turbo models, we exclude gpt-35-turbo-0613 and gpt-35-turbo-1106 but include gpt-35-turbo-0125.
  • Figure 3: Confusion matrices for each of the LLMs used to test large. Answers that cannot be interpreted as CD or inter-CD are considered invalid.
  • Figure 4: Accuracy by temperature for gpt-35-turbo-0125 applied to large.