Table of Contents
Fetching ...

Exploring Spatial Language Grounding Through Referring Expressions

Akshar Tumu, Parisa Kordjamshidi

TL;DR

This work probes spatial grounding in vision-language models by reframing spatial reasoning as a Referring Expression Comprehension task on the CopsRef dataset, which encodes 51 spatial relations across eight categories. It compares a task-specific REC model, MGA-Net, with two VLMs (Grounding DINO and LLaVA) and an OWL-ViT baseline, using three evaluation splits to examine the impact of spatial composition, visual complexity, and negation on grounding accuracy. The results show that task-specific compositional reasoning (MGA-Net) excels on geometric relations and multi-clause expressions, while VLMs struggle with directional and negated relations, and performance generally degrades as spatial complexity increases. These findings highlight the need for compositional, neuro-symbolic or negation-aware training to enhance spatial grounding in multimodal models, guiding future research toward more robust spatial reasoning capabilities in VLMs.

Abstract

Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation ('not'). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.

Exploring Spatial Language Grounding Through Referring Expressions

TL;DR

This work probes spatial grounding in vision-language models by reframing spatial reasoning as a Referring Expression Comprehension task on the CopsRef dataset, which encodes 51 spatial relations across eight categories. It compares a task-specific REC model, MGA-Net, with two VLMs (Grounding DINO and LLaVA) and an OWL-ViT baseline, using three evaluation splits to examine the impact of spatial composition, visual complexity, and negation on grounding accuracy. The results show that task-specific compositional reasoning (MGA-Net) excels on geometric relations and multi-clause expressions, while VLMs struggle with directional and negated relations, and performance generally degrades as spatial complexity increases. These findings highlight the need for compositional, neuro-symbolic or negation-aware training to enhance spatial grounding in multimodal models, guiding future research toward more robust spatial reasoning capabilities in VLMs.

Abstract

Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation ('not'). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.

Paper Structure

This paper contains 36 sections, 1 equation, 1 figure, 12 tables.

Figures (1)

  • Figure 1: Figures for Qualitative analysis. In the figures, the green box is the ground-truth bounding box. The red, blue, and yellow boxes are the output bounding boxes of MGA-Net, Grounding DINO, and LLaVA, respectively.