Table of Contents
Fetching ...

FoREST: Frame of Reference Evaluation in Spatial Reasoning Tasks

Tanawan Premsri, Parisa Kordjamshidi

TL;DR

FoREST tackles the under-explored problem of Frame of Reference ($FoR$) comprehension in spatial reasoning by introducing a multimodal benchmark that jointly assesses textual FoR understanding and grounding for text-to-image generation. It combines QA and diffusion-based layout tasks, with ambiguous (A-split) and unambiguous (C-split) contexts, to reveal model biases and cross-perspective reasoning challenges. A key contribution is Spatial-Guided prompting, which elicits spatial primitives (direction, topology, distance) and FoR information to enhance reasoning and layout quality, yielding improvements in both QA accuracy and image-layout accuracy. The results show persistent FoR biases in language-only models, gains from multimodal training, and substantial improvements when applying SG prompting, underscoring its practical impact for embodied AI and downstream spatial reasoning tasks.

Abstract

Spatial reasoning is a fundamental aspect of human intelligence. One key concept in spatial cognition is the Frame of Reference, which identifies the perspective of spatial expressions. Despite its significance, FoR has received limited attention in AI models that need spatial intelligence. There is a lack of dedicated benchmarks and in-depth evaluation of large language models (LLMs) in this area. To address this issue, we introduce the Frame of Reference Evaluation in Spatial Reasoning Tasks (FoREST) benchmark, designed to assess FoR comprehension in LLMs. We evaluate LLMs on answering questions that require FoR comprehension and layout generation in text-to-image models using FoREST. Our results reveal a notable performance gap across different FoR classes in various LLMs, affecting their ability to generate accurate layouts for text-to-image generation. This highlights critical shortcomings in FoR comprehension. To improve FoR understanding, we propose Spatial-Guided prompting, which improves LLMs ability to extract essential spatial concepts. Our proposed method improves overall performance across spatial reasoning tasks.

FoREST: Frame of Reference Evaluation in Spatial Reasoning Tasks

TL;DR

FoREST tackles the under-explored problem of Frame of Reference () comprehension in spatial reasoning by introducing a multimodal benchmark that jointly assesses textual FoR understanding and grounding for text-to-image generation. It combines QA and diffusion-based layout tasks, with ambiguous (A-split) and unambiguous (C-split) contexts, to reveal model biases and cross-perspective reasoning challenges. A key contribution is Spatial-Guided prompting, which elicits spatial primitives (direction, topology, distance) and FoR information to enhance reasoning and layout quality, yielding improvements in both QA accuracy and image-layout accuracy. The results show persistent FoR biases in language-only models, gains from multimodal training, and substantial improvements when applying SG prompting, underscoring its practical impact for embodied AI and downstream spatial reasoning tasks.

Abstract

Spatial reasoning is a fundamental aspect of human intelligence. One key concept in spatial cognition is the Frame of Reference, which identifies the perspective of spatial expressions. Despite its significance, FoR has received limited attention in AI models that need spatial intelligence. There is a lack of dedicated benchmarks and in-depth evaluation of large language models (LLMs) in this area. To address this issue, we introduce the Frame of Reference Evaluation in Spatial Reasoning Tasks (FoREST) benchmark, designed to assess FoR comprehension in LLMs. We evaluate LLMs on answering questions that require FoR comprehension and layout generation in text-to-image models using FoREST. Our results reveal a notable performance gap across different FoR classes in various LLMs, affecting their ability to generate accurate layouts for text-to-image generation. This highlights critical shortcomings in FoR comprehension. To improve FoR understanding, we propose Spatial-Guided prompting, which improves LLMs ability to extract essential spatial concepts. Our proposed method improves overall performance across spatial reasoning tasks.

Paper Structure

This paper contains 49 sections, 13 figures, 20 tables.

Figures (13)

  • Figure 1: Illustration of FoR classes. The cat is the locatum, the car is the relatum, and arrows denote the perspective.
  • Figure 2: The dataset creation pipeline. It begins by selecting a locatum and a relatum from a pre-defined list of objects and then applies templates to generate the spatial expressions ($T$). FoRs are then assigned based on the relatum properties. $T$ is categorized based on the number of applicable FoRs. For example, A cat is to the right of a dog (with two possible FoRs: external intrinsic and external relative) belongs to the A-split. Then, its disambiguated version (A cat is to the right of a dog from the dog's perspective) is added to the C-split. Next, if applicable, the relatum orientation is included for visualization and question generation. Finally, Unity3D generates the scene configurations, and the question-answer pairs are derived from $T$.
  • Figure 3: Confusion matrices of spatial relation predictions by Llama3 and GPT-4o in 0-shot and SG+CoT settings, when FoR adaptation is required.
  • Figure 4: All 3d models used to generate visualizations for FoREST.
  • Figure 5: Confusion matrices of spatial relation answers when Qwen2 and Qwen2-VL must adapt FoR in the 0-shot and (SG+CoT) settings.
  • ...and 8 more figures