Table of Contents
Fetching ...

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Zaibin Zhang, Zhedong Zheng, Zhipeng Zhang, Yifan Wang, Lin Song, Lijun Wang, Yanwei Li, Ying Shan, Huchuan Lu

TL;DR

This paper conducts a systematic study of Visual Spatial Reasoning (VSR) in Vision-Language Models (VLMs), introducing SIBench, a benchmark driving evaluation across 23 task settings organized into three cognitive levels. It shows a clear gap between perceptual abilities and higher-order spatial reasoning—models excel at attributes and basic relations but falter in numerical estimation, multi-view reasoning, temporal dynamics, and spatial imagination. The authors propose a four-axis methodology (input modalities, architectures, training, and inference) plus a cognition-based taxonomy to guide future work, including 3D-aware pretraining, higher-quality diverse data, and unified spatiotemporal architectures. The work also discusses practical applications in embodied AI and autonomous driving, underscoring the need for robust spatial understanding to enable real-world action and decision-making.

Abstract

Visual Spatial Reasoning (VSR) is a core human cognitive ability and a critical requirement for advancing embodied intelligence and autonomous systems. Despite recent progress in Vision-Language Models (VLMs), achieving human-level VSR remains highly challenging due to the complexity of representing and reasoning over three-dimensional space. In this paper, we present a systematic investigation of VSR in VLMs, encompassing a review of existing methodologies across input modalities, model architectures, training strategies, and reasoning mechanisms. Furthermore, we categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings. Experiments with state-of-the-art VLMs reveal a pronounced gap between perception and reasoning, as models show competence in basic perceptual tasks but consistently underperform in understanding and planning tasks, particularly in numerical estimation, multi-view reasoning, temporal dynamics, and spatial imagination. These findings underscore the substantial challenges that remain in achieving spatial intelligence, while providing both a systematic roadmap and a comprehensive benchmark to drive future research in the field. The related resources of this study are accessible at https://sibench.github.io/Awesome-Visual-Spatial-Reasoning/.

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

TL;DR

This paper conducts a systematic study of Visual Spatial Reasoning (VSR) in Vision-Language Models (VLMs), introducing SIBench, a benchmark driving evaluation across 23 task settings organized into three cognitive levels. It shows a clear gap between perceptual abilities and higher-order spatial reasoning—models excel at attributes and basic relations but falter in numerical estimation, multi-view reasoning, temporal dynamics, and spatial imagination. The authors propose a four-axis methodology (input modalities, architectures, training, and inference) plus a cognition-based taxonomy to guide future work, including 3D-aware pretraining, higher-quality diverse data, and unified spatiotemporal architectures. The work also discusses practical applications in embodied AI and autonomous driving, underscoring the need for robust spatial understanding to enable real-world action and decision-making.

Abstract

Visual Spatial Reasoning (VSR) is a core human cognitive ability and a critical requirement for advancing embodied intelligence and autonomous systems. Despite recent progress in Vision-Language Models (VLMs), achieving human-level VSR remains highly challenging due to the complexity of representing and reasoning over three-dimensional space. In this paper, we present a systematic investigation of VSR in VLMs, encompassing a review of existing methodologies across input modalities, model architectures, training strategies, and reasoning mechanisms. Furthermore, we categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings. Experiments with state-of-the-art VLMs reveal a pronounced gap between perception and reasoning, as models show competence in basic perceptual tasks but consistently underperform in understanding and planning tasks, particularly in numerical estimation, multi-view reasoning, temporal dynamics, and spatial imagination. These findings underscore the substantial challenges that remain in achieving spatial intelligence, while providing both a systematic roadmap and a comprehensive benchmark to drive future research in the field. The related resources of this study are accessible at https://sibench.github.io/Awesome-Visual-Spatial-Reasoning/.

Paper Structure

This paper contains 40 sections, 4 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Performance of SOTA Models on 23 Visual Spatial Reasoning Tasks (left). The evaluation reveals that the models have significant room for improvement, especially in tasks requiring precise numerical estimation, perspective taking, temporal information processing, and, particularly, spatial imagination. See Table \ref{['tab:model_performance']} and Table \ref{['tab:model_performance-mini']} for detailed results. Comparison of Visual Spatial Reasoning and General VQA (Upper-right). While general VQA tasks primarily focus on extracting semantic information from images, VSR necessitates a deeper capacity to model and reason about spatial relationships. Data Formats and Task Settings for Visual Spatial Reasoning (Bottom-right). The evaluation includes 3 input formats and 23 task settings, covering three levels: Basic Perception, Spatial Understanding, and Planning.
  • Figure 2: An overview of the primary methods (Bottom). I: Incorporating an additional input modality, such as depth maps. II: An additional spatial encoder is incorporated into the model architecture to provide 3D information. III: Leveraging Reinforcement Learning to improve generalization. IV: The inference phase employs methods such as cognitive maps to perform structured reasoning. Representative methods for the four categories (Upper).
  • Figure 3: Taxonomy of visual spatial reasoning according to cognitive levels.
  • Figure 4: Task Settings for Basic Perception. Basic perception tasks are categorized into static and state attributes based on whether the attribute is subject to change.
  • Figure 5: Categorization of Spatial Understanding Tasks. Spatial understanding tasks are divided into static and dynamic understanding. Dynamic understanding tasks are characterized by viewpoint shifts or a temporal component.
  • ...and 11 more figures