SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence

Ziyang Gong; Wenhao Li; Oliver Ma; Songyuan Li; Zhaokai Wang; Songyuan Li; Jiayi Ji; Xue Yang; Gen Luo; Junchi Yan; Rongrong Ji

SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence

Ziyang Gong, Wenhao Li, Oliver Ma, Songyuan Li, Zhaokai Wang, Songyuan Li, Jiayi Ji, Xue Yang, Gen Luo, Junchi Yan, Rongrong Ji

TL;DR

SpaCE-10 presents a comprehensive benchmark for evaluating compositional spatial intelligence in multimodal systems, bridging atomic spatial capabilities to complex compositional reasoning. It introduces a hierarchical QA generation pipeline and collects 5k+ QA pairs from 811 real indoor scenes, enabling evaluation across 2D and 3D modalities. Extensive experiments on nearly 50 MLLMs reveal human superiority and identify counting as a key bottleneck, with 2D approaches generally outperforming 3D counterparts. The work highlights effective pathways for improving spatial intelligence through counting-focused supervision and capability-aware training, providing a valuable resource for the community.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable progress in various multimodal tasks. To pursue higher intelligence in space, MLLMs require integrating multiple spatial capabilities, even for handling simple and normal tasks. However, existing benchmarks struggle to comprehensively evaluate the spatial intelligence of common MLLMs from the atomic level to the compositional level. To fill this gap, we present SpaCE-10, a comprehensive benchmark for compositional spatial evaluations. In SpaCE-10, we define 10 atomic spatial capabilities, which are combined to form 8 compositional capabilities. Based on these definitions, we propose a novel hierarchical annotation pipeline to generate high-quality and diverse question-answer (QA) pairs. With over 150+ hours of human expert effort, we obtain over 5k QA pairs for 811 real indoor scenes in SpaCE-10, which covers various evaluation settings like point cloud input and multi-choice QA. We conduct an extensive evaluation of common MLLMs on SpaCE-10 and find that even the most advanced MLLM still lags behind humans by large margins. Through our careful study, we also draw several significant findings that benefit the MLLM community. For example, we reveal that the shortcoming of counting capability greatly limits the compositional spatial capabilities of existing MLLMs.

SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence

TL;DR

Abstract

SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (22)