Table of Contents
Fetching ...

SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence

Ziyang Gong, Wenhao Li, Oliver Ma, Songyuan Li, Zhaokai Wang, Songyuan Li, Jiayi Ji, Xue Yang, Gen Luo, Junchi Yan, Rongrong Ji

TL;DR

SpaCE-10 presents a comprehensive benchmark for evaluating compositional spatial intelligence in multimodal systems, bridging atomic spatial capabilities to complex compositional reasoning. It introduces a hierarchical QA generation pipeline and collects 5k+ QA pairs from 811 real indoor scenes, enabling evaluation across 2D and 3D modalities. Extensive experiments on nearly 50 MLLMs reveal human superiority and identify counting as a key bottleneck, with 2D approaches generally outperforming 3D counterparts. The work highlights effective pathways for improving spatial intelligence through counting-focused supervision and capability-aware training, providing a valuable resource for the community.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable progress in various multimodal tasks. To pursue higher intelligence in space, MLLMs require integrating multiple spatial capabilities, even for handling simple and normal tasks. However, existing benchmarks struggle to comprehensively evaluate the spatial intelligence of common MLLMs from the atomic level to the compositional level. To fill this gap, we present SpaCE-10, a comprehensive benchmark for compositional spatial evaluations. In SpaCE-10, we define 10 atomic spatial capabilities, which are combined to form 8 compositional capabilities. Based on these definitions, we propose a novel hierarchical annotation pipeline to generate high-quality and diverse question-answer (QA) pairs. With over 150+ hours of human expert effort, we obtain over 5k QA pairs for 811 real indoor scenes in SpaCE-10, which covers various evaluation settings like point cloud input and multi-choice QA. We conduct an extensive evaluation of common MLLMs on SpaCE-10 and find that even the most advanced MLLM still lags behind humans by large margins. Through our careful study, we also draw several significant findings that benefit the MLLM community. For example, we reveal that the shortcoming of counting capability greatly limits the compositional spatial capabilities of existing MLLMs.

SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence

TL;DR

SpaCE-10 presents a comprehensive benchmark for evaluating compositional spatial intelligence in multimodal systems, bridging atomic spatial capabilities to complex compositional reasoning. It introduces a hierarchical QA generation pipeline and collects 5k+ QA pairs from 811 real indoor scenes, enabling evaluation across 2D and 3D modalities. Extensive experiments on nearly 50 MLLMs reveal human superiority and identify counting as a key bottleneck, with 2D approaches generally outperforming 3D counterparts. The work highlights effective pathways for improving spatial intelligence through counting-focused supervision and capability-aware training, providing a valuable resource for the community.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable progress in various multimodal tasks. To pursue higher intelligence in space, MLLMs require integrating multiple spatial capabilities, even for handling simple and normal tasks. However, existing benchmarks struggle to comprehensively evaluate the spatial intelligence of common MLLMs from the atomic level to the compositional level. To fill this gap, we present SpaCE-10, a comprehensive benchmark for compositional spatial evaluations. In SpaCE-10, we define 10 atomic spatial capabilities, which are combined to form 8 compositional capabilities. Based on these definitions, we propose a novel hierarchical annotation pipeline to generate high-quality and diverse question-answer (QA) pairs. With over 150+ hours of human expert effort, we obtain over 5k QA pairs for 811 real indoor scenes in SpaCE-10, which covers various evaluation settings like point cloud input and multi-choice QA. We conduct an extensive evaluation of common MLLMs on SpaCE-10 and find that even the most advanced MLLM still lags behind humans by large margins. Through our careful study, we also draw several significant findings that benefit the MLLM community. For example, we reveal that the shortcoming of counting capability greatly limits the compositional spatial capabilities of existing MLLMs.

Paper Structure

This paper contains 27 sections, 2 equations, 22 figures, 7 tables.

Figures (22)

  • Figure 1: Overview of SpaCE-10 benchmark. SpaCE-10 takes over 150 human expert hours to collect 5k+ QA pairs in 811 indoor scenes, which can evaluate MLLMs from 10 atomic capabilities to 8 compositional capabilities. Through evaluations, SpaCE-10 indicates that even the most advanced MLLM still lags far behind humans by large margins. Green cirle means the correct answer.
  • Figure 2: Dataset analysis of SpaCE-10. (a) Number distribution of each QA type. SpaCE-10 consists of 8 QA types that are EQ (Entity Quantification), SQ (Scene Quantification), SA (Size Assessment), OO (Object-Object Spatial Relationship), OS (Object-Scene Spatial Relationship), EP (Entity Presence), FR (Functional Reasoning), and SP (Spatial Planning). (b) Average vocabulary size per QA type for question, option, and average. (c) Average character length per QA type. (d) Coverage of the atomic capabilities (C1-C10). (e) The correlation between human expert accuracy and average character length across six QA types. (f) Capability co-occurrence matrix.
  • Figure 3: Illustration of our hierarchical annotation pipeline. We generate structural data to construct over 10k QA pairs, and performs capability integration to obtain over 5k QA pairs with 10 compositional capabilities. This process takes over 150 expert hours for data collection and filtering.
  • Figure 4: Results of representative MLLMs on 10 atomic capabilities of SpaCE-10. Each value reflects the model's average accuracy (%) across all question types involving the respective spatial capability (C1-C10), as defined in the benchmark's task-to-capability mapping.
  • Figure 5: Examples of all types of QA. The blue examples represent the perception QA, and the purple ones denote the reasoning QA. The green circles are the correct answer.
  • ...and 17 more figures