Table of Contents
Fetching ...

InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models

Nianchen Deng, Lixin Gu, Shenglong Ye, Yinan He, Zhe Chen, Songze Li, Haomin Wang, Xingguang Wei, Tianshuo Yang, Min Dou, Tong He, Wenqi Shao, Kaipeng Zhang, Yi Wang, Botian Shi, Yanting Zhang, Jifeng Dai, Yu Qiao, Hongjie Zhang, Wenhai Wang

TL;DR

This work tackles the limited spatial reasoning capabilities of vision-language models by introducing InternSpatial, a large-scale open-source dataset with 12 million QA pairs and 19 instruction formats, spanning single-view and multi-view contexts. A dedicated InternSpatial-Bench benchmark, including a rotation angle prediction task for multi-view evaluation, provides a comprehensive diagnostic suite. The authors develop a modular data engine to automatically generate annotations, align to canonical view space, and synthesize template-based QA, enabling scalable and diverse QA generation. Experimental results show that fine-tuning VLMs on InternSpatial yields substantial gains on spatial reasoning benchmarks while preserving performance on general multimodal tasks, illustrating the approach's practical impact for robotics and embodied AI.

Abstract

Recent benchmarks and datasets have been proposed to improve spatial reasoning in vision-language models (VLMs), yet existing open resources remain limited in scale, visual diversity, and instruction expressiveness. In this work, we introduce InternSpatial, the largest open-source dataset for spatial reasoning in VLMs, along with InternSpatial-Bench, a corresponding evaluation benchmark designed to assess spatial understanding under diverse instruction formats. InternSpatial comprises 12 million QA pairs spanning both single-view and multi-view settings, drawn from diverse visual environments and supporting 19 instruction formats that reflect varied query styles. For evaluation, we propose InternSpatial-Bench for single-view tasks and expand multi-view reasoning by introducing a novel rotation angle prediction task that has not been explored in prior work. Experimental results show that models trained on InternSpatial achieve 12.1% improvement on InternSpatial-Bench and 10.7% on VSI-Bench, while maintaining strong performance on general-purpose benchmarks. We hope these resources will support the development of spatially capable VLMs in practical applications such as robotics and embodied AI.

InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models

TL;DR

This work tackles the limited spatial reasoning capabilities of vision-language models by introducing InternSpatial, a large-scale open-source dataset with 12 million QA pairs and 19 instruction formats, spanning single-view and multi-view contexts. A dedicated InternSpatial-Bench benchmark, including a rotation angle prediction task for multi-view evaluation, provides a comprehensive diagnostic suite. The authors develop a modular data engine to automatically generate annotations, align to canonical view space, and synthesize template-based QA, enabling scalable and diverse QA generation. Experimental results show that fine-tuning VLMs on InternSpatial yields substantial gains on spatial reasoning benchmarks while preserving performance on general multimodal tasks, illustrating the approach's practical impact for robotics and embodied AI.

Abstract

Recent benchmarks and datasets have been proposed to improve spatial reasoning in vision-language models (VLMs), yet existing open resources remain limited in scale, visual diversity, and instruction expressiveness. In this work, we introduce InternSpatial, the largest open-source dataset for spatial reasoning in VLMs, along with InternSpatial-Bench, a corresponding evaluation benchmark designed to assess spatial understanding under diverse instruction formats. InternSpatial comprises 12 million QA pairs spanning both single-view and multi-view settings, drawn from diverse visual environments and supporting 19 instruction formats that reflect varied query styles. For evaluation, we propose InternSpatial-Bench for single-view tasks and expand multi-view reasoning by introducing a novel rotation angle prediction task that has not been explored in prior work. Experimental results show that models trained on InternSpatial achieve 12.1% improvement on InternSpatial-Bench and 10.7% on VSI-Bench, while maintaining strong performance on general-purpose benchmarks. We hope these resources will support the development of spatially capable VLMs in practical applications such as robotics and embodied AI.

Paper Structure

This paper contains 33 sections, 25 figures, 8 tables.

Figures (25)

  • Figure 1: Generation pipeline for InternSpatial. The optional flows (represented by dashed lines and boxes) are only performed when the relevant annotations does not exist in the data source.
  • Figure 2: Examples of diverse instruction formats in text and image. The four images illustrate different visual formats: original (top-left), bounding boxes (top-right), segmentation masks (bottom-left), and numbered regions (bottom-right). Surrounding the images are seven corresponding text instruction formats. The color blocks beside each image indicate whether the corresponding image-text pair is included in InternSpatial and InternSpatial-Bench. Best viewed in color.
  • Figure 3: Distribution of instruction formats (Left) and data sources (Right) in InternSpatial.
  • Figure 4: Distribution of instruction formats (Left) and data sources (Right) in InternSpatial-Bench.
  • Figure 5: The results of the different image (Left) and text (Right) formats in the ablation study.
  • ...and 20 more figures