CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models
Yiqi Zhu, Ziyue Wang, Can Zhang, Peng Li, Yang Liu
TL;DR
CoSpace defines continuous space perception for Vision-Language Models and introduces a benchmark to evaluate how VLMs integrate spatial information across multiple views captured from a fixed viewpoint. It spans seven tasks across four capability axes (direction recognition, space grounding, angle-aware rotation, counting, and embodied planning) using 2,918 images and 1,626 QA pairs collected from Baidu Panorama API and HM3D. Evaluation across 19 models reveals gaps in consistency and rotation-based reasoning for open-source models, while proprietary models show higher consistency but still underperform in key areas. The work highlights the necessity of continuous spatial grounding for real-world tasks and provides a resource to drive progress in multi-image continuous-space perception.
Abstract
Vision-Language Models (VLMs) have recently witnessed significant progress in visual comprehension. As the permitting length of image context grows, VLMs can now comprehend a broader range of views and spaces. Current benchmarks provide insightful analysis of VLMs in tasks involving complex visual instructions following, multi-image understanding and spatial reasoning. However, they usually focus on spatially irrelevant images or discrete images captured from varied viewpoints. The compositional characteristic of images captured from a static viewpoint remains underestimated. We term this characteristic as Continuous Space Perception. When observing a scene from a static viewpoint while shifting orientations, it produces a series of spatially continuous images, enabling the reconstruction of the entire space. In this paper, we present CoSpace, a multi-image visual understanding benchmark designed to assess the Continuous Space perception ability for VLMs. CoSpace contains 2,918 images and 1,626 question-answer pairs, covering seven types of tasks. We conduct evaluation across 19 proprietary and open-source VLMs. Results reveal that there exist pitfalls on the continuous space perception ability for most of the evaluated models, including proprietary ones. Interestingly, we find that the main discrepancy between open-source and proprietary models lies not in accuracy but in the consistency of responses. We believe that enhancing the ability of continuous space perception is essential for VLMs to perform effectively in real-world tasks and encourage further research to advance this capability.
