Table of Contents
Fetching ...

Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis

Zhi Helu, Huang Jingjing, Xu Wang, Xu Yangbin, Zhang Wanyue, Jiang Baoyang, Deng Shirui, Zhu Liang, Li Fangfang, Zhao Tiejun, Lin Yankai, Yao Yuan

TL;DR

SPRITE tackles the data bottleneck in spatial reasoning for vision-language models by programmatically synthesizing ground-truth via executable code. It uses simulators to collect rich scene meta-information and LLMs to generate diverse questions as well as the code that computes precise ground-truth answers. A rigorous QC process ensures correctness and consistency, enabling a large-scale SPRITE-300K dataset that improves performance on multiple spatial benchmarks and generalizes across model architectures. The approach demonstrates that increasing scene diversity is crucial for robust spatial intelligence, outperforming template-based datasets and enabling scalable research in embodied AI.

Abstract

Embodied intelligence, a grand challenge in artificial intelligence, is fundamentally constrained by the limited spatial understanding and reasoning capabilities of current models. Prevailing efforts to address this through enhancing Vision-Language Models (VLMs) are trapped in a dilemma: template-based datasets are scalable but structurally rigid, while manual annotation is linguistically diverse but unscalable and, critically, computationally imprecise. We introduce SPRITE, a novel framework that overcomes this dilemma by leveraging simulators and large models to programmatically synthesize scalable, diverse, and high-quality spatial reasoning data. The core innovation of SPRITE is to reframe ground-truth generation as a code-generation task. We utilize LLMs to compile complex spatial questions into executable programs, which are then verified against high-precision scene meta-information extracted from simulators. This ensures our ground truth is both computationally precise and verifiable, while the generative power of LLMs provides vast linguistic diversity. Leveraging this pipeline, we have curated a dataset encompassing 3 simulators, 11k+ scenes, and 300k+ image/video instruction-tuning pairs. We demonstrate that a VLM trained on our data achieves significant performance gains on multiple spatial benchmarks and outperforms other open-source datasets of equivalent size. Furthermore, a scalability analysis confirms our hypothesis that overcoming the low-diversity nature of traditional template methods is essential for building robust, generalizable spatial intelligence. We will make the SPRITE framework code and the full 300k+ dataset publicly available to facilitate future research in spatial intelligence.

Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis

TL;DR

SPRITE tackles the data bottleneck in spatial reasoning for vision-language models by programmatically synthesizing ground-truth via executable code. It uses simulators to collect rich scene meta-information and LLMs to generate diverse questions as well as the code that computes precise ground-truth answers. A rigorous QC process ensures correctness and consistency, enabling a large-scale SPRITE-300K dataset that improves performance on multiple spatial benchmarks and generalizes across model architectures. The approach demonstrates that increasing scene diversity is crucial for robust spatial intelligence, outperforming template-based datasets and enabling scalable research in embodied AI.

Abstract

Embodied intelligence, a grand challenge in artificial intelligence, is fundamentally constrained by the limited spatial understanding and reasoning capabilities of current models. Prevailing efforts to address this through enhancing Vision-Language Models (VLMs) are trapped in a dilemma: template-based datasets are scalable but structurally rigid, while manual annotation is linguistically diverse but unscalable and, critically, computationally imprecise. We introduce SPRITE, a novel framework that overcomes this dilemma by leveraging simulators and large models to programmatically synthesize scalable, diverse, and high-quality spatial reasoning data. The core innovation of SPRITE is to reframe ground-truth generation as a code-generation task. We utilize LLMs to compile complex spatial questions into executable programs, which are then verified against high-precision scene meta-information extracted from simulators. This ensures our ground truth is both computationally precise and verifiable, while the generative power of LLMs provides vast linguistic diversity. Leveraging this pipeline, we have curated a dataset encompassing 3 simulators, 11k+ scenes, and 300k+ image/video instruction-tuning pairs. We demonstrate that a VLM trained on our data achieves significant performance gains on multiple spatial benchmarks and outperforms other open-source datasets of equivalent size. Furthermore, a scalability analysis confirms our hypothesis that overcoming the low-diversity nature of traditional template methods is essential for building robust, generalizable spatial intelligence. We will make the SPRITE framework code and the full 300k+ dataset publicly available to facilitate future research in spatial intelligence.

Paper Structure

This paper contains 21 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of SPRITE. Our dataset contains 210k+ image-centered data, 90k+ video-centered data. Models trained with our data have demonstrated improved performance across multiple spatial understanding benchmarks.
  • Figure 2: Data generation approach. Our approach consists of five main parts:(1) Data collection, (2) Reference Generation, (3) Leveraging the collected data to generate diverse questions, (4) Obtaining ground truth from the executable code that processes meta-information.
  • Figure 3: Spatial task framework. The task framework includes four types of tasks: video-based questions, image-based questions, navigation problems, and compound problems. Among them, the image-based questions contain 18 subclasses, and the video-based questions contain 16 subclasses.
  • Figure 4: Compound question example.
  • Figure 5: Comprehensive analysis of data scaling and model performance. (a) Scaling laws with respect to the scene scale ($k$). (b) Performance comparison between our method and the template across varying numbers of training samples ($w$). (c) Ablation study on different data modalities (image, video, and hybrid) for the number of training samples ($w$).
  • ...and 1 more figures