Table of Contents
Fetching ...

ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models

Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Yueting Zhuang

TL;DR

This work identifies a major gap in vision-language models: strong egocentric spatial reasoning but weak cross-viewpoint (allocentric) spatial understanding. It introduces ViewSpatial-Bench, a first-of-its-kind benchmark with 5 tasks and 5,700+ samples from 1,338 scenes, built with an automated 3D orientation annotation pipeline using ScanNet and MS-CoCo sources. The authors show a substantial performance gap across current VLMs and propose the Multi-View Spatial Model (MVSM), trained on ~43K multi-perspective samples, achieving a 46.24% absolute improvement over baselines. MVSM also generalizes to embodied, real-world tasks (VSI-Bench and VSI-App), underscoring the value of perspective-aware training for spatial intelligence in embodied AI, while noting limitations in annotation scalability and environmental domain coverage.

Abstract

Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We identify a critical limitation: current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints when required to adopt another entity's spatial frame of reference. We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation across five distinct task types, supported by an automated 3D annotation pipeline that generates precise directional labels. Comprehensive evaluation of diverse VLMs on ViewSpatial-Bench reveals a significant performance disparity: models demonstrate reasonable performance on camera-perspective tasks but exhibit reduced accuracy when reasoning from a human viewpoint. By fine-tuning VLMs on our multi-perspective spatial dataset, we achieve an overall performance improvement of 46.24% across tasks, highlighting the efficacy of our approach. Our work establishes a crucial benchmark for spatial intelligence in embodied AI systems and provides empirical evidence that modeling 3D spatial relationships enhances VLMs' corresponding spatial comprehension capabilities.

ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models

TL;DR

This work identifies a major gap in vision-language models: strong egocentric spatial reasoning but weak cross-viewpoint (allocentric) spatial understanding. It introduces ViewSpatial-Bench, a first-of-its-kind benchmark with 5 tasks and 5,700+ samples from 1,338 scenes, built with an automated 3D orientation annotation pipeline using ScanNet and MS-CoCo sources. The authors show a substantial performance gap across current VLMs and propose the Multi-View Spatial Model (MVSM), trained on ~43K multi-perspective samples, achieving a 46.24% absolute improvement over baselines. MVSM also generalizes to embodied, real-world tasks (VSI-Bench and VSI-App), underscoring the value of perspective-aware training for spatial intelligence in embodied AI, while noting limitations in annotation scalability and environmental domain coverage.

Abstract

Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We identify a critical limitation: current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints when required to adopt another entity's spatial frame of reference. We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation across five distinct task types, supported by an automated 3D annotation pipeline that generates precise directional labels. Comprehensive evaluation of diverse VLMs on ViewSpatial-Bench reveals a significant performance disparity: models demonstrate reasonable performance on camera-perspective tasks but exhibit reduced accuracy when reasoning from a human viewpoint. By fine-tuning VLMs on our multi-perspective spatial dataset, we achieve an overall performance improvement of 46.24% across tasks, highlighting the efficacy of our approach. Our work establishes a crucial benchmark for spatial intelligence in embodied AI systems and provides empirical evidence that modeling 3D spatial relationships enhances VLMs' corresponding spatial comprehension capabilities.

Paper Structure

This paper contains 41 sections, 9 figures, 6 tables, 2 algorithms.

Figures (9)

  • Figure 1: ViewSpatial-Bench for multi-perspective spatial reasoning. Our benchmark evaluates spatial localization capabilities from both camera and human perspectives across five task types.
  • Figure 2: ViewSpatial-Bench construction pipeline. From data collection to QA generation across camera perspective () and person perspective () tasks. The pipeline includes metadata creation, automatic filtering, spatial relation extraction, and manual verification.
  • Figure 3: Distribution of task categories in ViewSpatial-Bench, balanced between ScanNet-Source and CoCo-Source approaches, with five distinct subtasks for comprehensive evaluation of spatial reasoning across different viewpoints.
  • Figure 4: The image compares spatial reasoning performance between GPT-4o and MVSM on the VSI-App dataset, showing several examples where MVSM correctly answers perspective-taking questions about object locations, while GPT-4o makes errors when attempting to determine spatial relationships from another person's viewpoint.
  • Figure 5: Wordcloud of object categories.
  • ...and 4 more figures