Table of Contents
Fetching ...

CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models

Yiqi Zhu, Ziyue Wang, Can Zhang, Peng Li, Yang Liu

TL;DR

CoSpace defines continuous space perception for Vision-Language Models and introduces a benchmark to evaluate how VLMs integrate spatial information across multiple views captured from a fixed viewpoint. It spans seven tasks across four capability axes (direction recognition, space grounding, angle-aware rotation, counting, and embodied planning) using 2,918 images and 1,626 QA pairs collected from Baidu Panorama API and HM3D. Evaluation across 19 models reveals gaps in consistency and rotation-based reasoning for open-source models, while proprietary models show higher consistency but still underperform in key areas. The work highlights the necessity of continuous spatial grounding for real-world tasks and provides a resource to drive progress in multi-image continuous-space perception.

Abstract

Vision-Language Models (VLMs) have recently witnessed significant progress in visual comprehension. As the permitting length of image context grows, VLMs can now comprehend a broader range of views and spaces. Current benchmarks provide insightful analysis of VLMs in tasks involving complex visual instructions following, multi-image understanding and spatial reasoning. However, they usually focus on spatially irrelevant images or discrete images captured from varied viewpoints. The compositional characteristic of images captured from a static viewpoint remains underestimated. We term this characteristic as Continuous Space Perception. When observing a scene from a static viewpoint while shifting orientations, it produces a series of spatially continuous images, enabling the reconstruction of the entire space. In this paper, we present CoSpace, a multi-image visual understanding benchmark designed to assess the Continuous Space perception ability for VLMs. CoSpace contains 2,918 images and 1,626 question-answer pairs, covering seven types of tasks. We conduct evaluation across 19 proprietary and open-source VLMs. Results reveal that there exist pitfalls on the continuous space perception ability for most of the evaluated models, including proprietary ones. Interestingly, we find that the main discrepancy between open-source and proprietary models lies not in accuracy but in the consistency of responses. We believe that enhancing the ability of continuous space perception is essential for VLMs to perform effectively in real-world tasks and encourage further research to advance this capability.

CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models

TL;DR

CoSpace defines continuous space perception for Vision-Language Models and introduces a benchmark to evaluate how VLMs integrate spatial information across multiple views captured from a fixed viewpoint. It spans seven tasks across four capability axes (direction recognition, space grounding, angle-aware rotation, counting, and embodied planning) using 2,918 images and 1,626 QA pairs collected from Baidu Panorama API and HM3D. Evaluation across 19 models reveals gaps in consistency and rotation-based reasoning for open-source models, while proprietary models show higher consistency but still underperform in key areas. The work highlights the necessity of continuous spatial grounding for real-world tasks and provides a resource to drive progress in multi-image continuous-space perception.

Abstract

Vision-Language Models (VLMs) have recently witnessed significant progress in visual comprehension. As the permitting length of image context grows, VLMs can now comprehend a broader range of views and spaces. Current benchmarks provide insightful analysis of VLMs in tasks involving complex visual instructions following, multi-image understanding and spatial reasoning. However, they usually focus on spatially irrelevant images or discrete images captured from varied viewpoints. The compositional characteristic of images captured from a static viewpoint remains underestimated. We term this characteristic as Continuous Space Perception. When observing a scene from a static viewpoint while shifting orientations, it produces a series of spatially continuous images, enabling the reconstruction of the entire space. In this paper, we present CoSpace, a multi-image visual understanding benchmark designed to assess the Continuous Space perception ability for VLMs. CoSpace contains 2,918 images and 1,626 question-answer pairs, covering seven types of tasks. We conduct evaluation across 19 proprietary and open-source VLMs. Results reveal that there exist pitfalls on the continuous space perception ability for most of the evaluated models, including proprietary ones. Interestingly, we find that the main discrepancy between open-source and proprietary models lies not in accuracy but in the consistency of responses. We believe that enhancing the ability of continuous space perception is essential for VLMs to perform effectively in real-world tasks and encourage further research to advance this capability.

Paper Structure

This paper contains 27 sections, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Examples of CoSpace, along with prompt templates for evaluation. We also provide extra visual guidance and textual rationales for readers (invisible to models during evaluation) for easier understanding.
  • Figure 2: Distribution of categories and tasks in our CoSpace.
  • Figure 3: Variance in ACC$_q$ of the average on the Direction and Counting categories for eight selected models. Model abbreviations are used for simplicity. Accuracy in the figure represents the average of ACC$_q$ over two categories.
  • Figure 4: Case for single image pipeline. For illustration, we showcase all the images in the figure, but models can only see one image at the same time. The responses in this case are all generated by MiniCPM-V 2.6.
  • Figure 5: Cases of generated rationales in the Rotation-Angle task.
  • ...and 6 more figures