Table of Contents
Fetching ...

SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation

Wenyu Zhang, Wei En Ng, Lixin Ma, Yuwen Wang, Junqi Zhao, Allison Koenecke, Boyang Li, Lu Wang

TL;DR

This work introduces SPHERE, a hierarchical evaluation framework and a manually annotated MS COCO‑based dataset to systematically probe spatial perception and reasoning in vision‑language systems. It dissects skills from single to multi‑skill and high‑level reasoning (including occlusion and object manipulation), revealing substantial deficiencies in current state‑of‑the‑art models, particularly in distance, size constancy, and perspective (allocentric vs egocentric) reasoning. The findings show that even spatially aware models struggle with integrated spatial reasoning tasks, highlighting the need for methods that align model spatial cognition with human spatial understanding. By providing a structured benchmark and detailed analyses, SPHERE aims to drive progress toward more robust, human‑like spatial reasoning in vision‑language technologies, with code and data available at the provided GitHub repository.

Abstract

Current vision-language models may grasp basic spatial cues and simple directions (e.g. left, right, front, back), but struggle with the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications. To address this gap, we develop SPHERE (Spatial Perception and Hierarchical Evaluation of REasoning), a hierarchical evaluation framework supported by a new human-annotated dataset. SPHERE systematically probes models across increasing levels of complexity, from fundamental skills to multi-skill integration and high-level reasoning that combines spatial, visual, and logical understanding. Benchmark evaluation of state-of-the-art models reveals significant deficiencies, especially in reasoning about distance and proximity, understanding both egocentric and allocentric perspectives, and applying spatial logic in physical contexts. These findings expose critical blind spots in existing models and underscore the need for more advanced spatial reasoning techniques, driving the development of vision-language models that align more closely with human spatial cognition. The SPHERE benchmark is available at https://github.com/zwenyu/SPHERE-VLM.

SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation

TL;DR

This work introduces SPHERE, a hierarchical evaluation framework and a manually annotated MS COCO‑based dataset to systematically probe spatial perception and reasoning in vision‑language systems. It dissects skills from single to multi‑skill and high‑level reasoning (including occlusion and object manipulation), revealing substantial deficiencies in current state‑of‑the‑art models, particularly in distance, size constancy, and perspective (allocentric vs egocentric) reasoning. The findings show that even spatially aware models struggle with integrated spatial reasoning tasks, highlighting the need for methods that align model spatial cognition with human spatial understanding. By providing a structured benchmark and detailed analyses, SPHERE aims to drive progress toward more robust, human‑like spatial reasoning in vision‑language technologies, with code and data available at the provided GitHub repository.

Abstract

Current vision-language models may grasp basic spatial cues and simple directions (e.g. left, right, front, back), but struggle with the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications. To address this gap, we develop SPHERE (Spatial Perception and Hierarchical Evaluation of REasoning), a hierarchical evaluation framework supported by a new human-annotated dataset. SPHERE systematically probes models across increasing levels of complexity, from fundamental skills to multi-skill integration and high-level reasoning that combines spatial, visual, and logical understanding. Benchmark evaluation of state-of-the-art models reveals significant deficiencies, especially in reasoning about distance and proximity, understanding both egocentric and allocentric perspectives, and applying spatial logic in physical contexts. These findings expose critical blind spots in existing models and underscore the need for more advanced spatial reasoning techniques, driving the development of vision-language models that align more closely with human spatial cognition. The SPHERE benchmark is available at https://github.com/zwenyu/SPHERE-VLM.

Paper Structure

This paper contains 22 sections, 10 figures, 10 tables.

Figures (10)

  • Figure 1: State-of-the-art models such as GPT-4o still has difficulty on questions that require multiple spatial, visual and reasoning skills. GPT-4o itself bolded the words in the second and third examples.
  • Figure 2: The proposed SPHERE framework evaluates vision-language models on a hierarchy of tasks, advancing from single-skill tasks to multi-skill tasks, and ultimately to complex reasoning tasks that require the integration of multiple spatial and visual cues with logical reasoning abilities.
  • Figure 3: Examples of open-ended questions for counting-related tasks and multiple-choice questions for other tasks in the annotated SPHERE dataset. Ground-truth answers are in green.
  • Figure 4: Examples of position-related questions asked from allocentric and egocentric viewpoints.
  • Figure 5: Distribution of ground-truth answers for the counting task.
  • ...and 5 more figures