Table of Contents
Fetching ...

Scaling Spatial Intelligence with Multimodal Foundation Models

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang

TL;DR

The paper addresses the limited spatial intelligence of multimodal foundation models by adopting a data centric scaling strategy. It introduces SenseNova-SI-8M, an eight million sample data corpus spanning five spatial capabilities and built on open backbones such as Qwen3-VL, InternVL3, and Bagel, achieving state of the art on multiple spatial benchmarks while preserving general multimodal performance. Through extensive experiments, the authors analyze data scaling laws, emergent generalization, overfitting risks, and a preliminary spatial chain of thought, and demonstrate downstream improvements in embodied tasks. The work emphasizes dataset diversity and task coverage as drivers of spatial capability and provides public releases to catalyze community progress in embodied and spatial reasoning. Overall, SenseNova-SI establishes a robust, scalable foundation for spatial intelligence in multimodal models and highlights directions beyond data scaling for future gains.

Abstract

Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.

Scaling Spatial Intelligence with Multimodal Foundation Models

TL;DR

The paper addresses the limited spatial intelligence of multimodal foundation models by adopting a data centric scaling strategy. It introduces SenseNova-SI-8M, an eight million sample data corpus spanning five spatial capabilities and built on open backbones such as Qwen3-VL, InternVL3, and Bagel, achieving state of the art on multiple spatial benchmarks while preserving general multimodal performance. Through extensive experiments, the authors analyze data scaling laws, emergent generalization, overfitting risks, and a preliminary spatial chain of thought, and demonstrate downstream improvements in embodied tasks. The work emphasizes dataset diversity and task coverage as drivers of spatial capability and provides public releases to catalyze community progress in embodied and spatial reasoning. Overall, SenseNova-SI establishes a robust, scalable foundation for spatial intelligence in multimodal models and highlights directions beyond data scaling for future gains.

Abstract

Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.

Paper Structure

This paper contains 50 sections, 8 figures, 13 tables, 1 algorithm.

Figures (8)

  • Figure 1: Guided by taxonomy of spatial intelligence cai2025has, we scaled spatial data to construct SenseNova-SI-8M, which we leverage to investigate the impact of data scaling on cultivating spatial capabilities in various MLLMs. The four subfigures at the corners elaborate SenseNova-SI's performance on four core spatial capabilities (i.e., Perspective-taking, Spatial Relations, Metric Measurement, and Comprehensive Reasoning). Through data scaling, SenseNova-SI surpassing open-source models and even outperforms GPT-5 in specific spatial abilities, such as Perspective-taking. The lines denote the average performance across benchmark subtasks within each capability, while the shaded regions (confidence bands) represent $\pm0.5$ standard deviation. At center, we show SenseNova-SI achieves state-of-the-art (SoTA) results on five recent spatial intelligence benchmarks (VSI, MMSI, MindCube, ViewSpatial, and SITE) while maintaining strong performance on a general multimodal benchmark (MMBench-En).
  • Figure 2: SenseNova-SI-8M reorganizes 4M open-source data and scales 4.5M additional data, according to fundamental spatial capbilities cai2025has. It covers general visual understanding (Non-SI), 2D grounding, and five core spatial abilities: Metric Measurement (MM), Spatial Relationship (SR), Perspective-Taking (PT), Mental Reconstruction (MR), and Comprehensive Reasoning (CR). Notably, SenseNova-SI-8M addresses the previously overlooked PT tasks. How data from each source is mapped to the core spatial capabilities is illustrated at the top (with a scale in the upper-right corner indicating the number of QA pairs), while representative data samples are organized by core capability. The "Hugging Face" symbol indicates community datasets. The rest are curated for further scaling.
  • Figure 3: Observations on generalization ability from a single data source and single task. The upper example demonstrates how training on ego-exo association task enhance performance on task required imagined first-person perspectives. The lower example demonstrates how a camera rotation task, based on cross-view visual correspondence, generalizes to tasks with distinct questions and visual appearances. These findings suggest the potential existence of meta-tasks in PT, which may enable related spatial capabilities.
  • Figure 4: Visualization of the manipulation task rollout in EmbodiedBench yang2025embodiedbench, performed by the embodied agent powered by SenseNova-SI.
  • Figure 5: Hard cases in MessyTable cai2020messytable, where multiple instances of the same object class are present in the same scene.
  • ...and 3 more figures