Table of Contents
Fetching ...

Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models

Akshar Tumu, Varad Shinde, Parisa Kordjamshidi

TL;DR

The paper probes spatial reasoning in Vision-Language Models by reframing evaluation around Referring Expression Comprehension (REC), enabling simultaneous grounding and relational reasoning under ambiguity, longer expressions, and negation. It compares a task-specific MGA-Net against large VLMs (LLaVA, Grounding DINO, DeepSeek-VL2, Qwen) and OWL-ViT across carefully designed splits that vary spatial complexity and visual difficulty. Key findings show that geometric, unambiguous relations are easier for grounding, while directional and negated expressions pose substantial challenges; MGA-Net's compositional, graph-based approach maintains performance with increasing relational complexity, unlike many VLMs. The work highlights gaps for future directions, including incorporating metric relations, neuro-symbolic processing, and targeted negation-training to enhance robust spatial grounding in multimodal systems.

Abstract

Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation ('not'). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.

Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models

TL;DR

The paper probes spatial reasoning in Vision-Language Models by reframing evaluation around Referring Expression Comprehension (REC), enabling simultaneous grounding and relational reasoning under ambiguity, longer expressions, and negation. It compares a task-specific MGA-Net against large VLMs (LLaVA, Grounding DINO, DeepSeek-VL2, Qwen) and OWL-ViT across carefully designed splits that vary spatial complexity and visual difficulty. Key findings show that geometric, unambiguous relations are easier for grounding, while directional and negated expressions pose substantial challenges; MGA-Net's compositional, graph-based approach maintains performance with increasing relational complexity, unlike many VLMs. The work highlights gaps for future directions, including incorporating metric relations, neuro-symbolic processing, and targeted negation-training to enhance robust spatial grounding in multimodal systems.

Abstract

Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation ('not'). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.

Paper Structure

This paper contains 38 sections, 1 equation, 1 figure, 11 tables.

Figures (1)

  • Figure 1: Figures for qualitative analysis. Bounding box legend - Red: MGA-Net, Blue: GDINO, Yellow: LLaVA, Orange: DeepSeek-VL2, Pink: Qwen2.5-VL, Green: Ground-truth