MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

Ruiyuan Lyu; Jingli Lin; Tai Wang; Shuai Yang; Xiaohan Mao; Yilun Chen; Runsen Xu; Haifeng Huang; Chenming Zhu; Dahua Lin; Jiangmiao Pang

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

Ruiyuan Lyu, Jingli Lin, Tai Wang, Shuai Yang, Xiaohan Mao, Yilun Chen, Runsen Xu, Haifeng Huang, Chenming Zhu, Dahua Lin, Jiangmiao Pang

TL;DR

MMScan introduces the largest multi-modal 3D scene dataset with hierarchical grounded language annotations, built via a top-down, VLM-assisted, human-in-the-loop pipeline on real-scanned data. It provides meta-annotations that enable scalable generation of visual grounding, QA, and grounding-enabled captions, along with benchmarks to assess and train 3D-LLMs. Experimental results show that MMScan poses new challenges, but also enables substantial performance gains for 3D grounding and language-model alignment through instruction tuning and data-driven training. The dataset facilitates advances in training robust 3D-LLMs and highlights the importance of scene-level, hierarchical grounding for real-world embodied AI tasks.

Abstract

With the emergence of LLMs and their integration with other data modalities, multi-modal 3D perception attracts more attention due to its connectivity to the physical world and makes rapid progress. However, limited by existing datasets, previous works mainly focus on understanding object properties or inter-object spatial relationships in a 3D scene. To tackle this problem, this paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan. It is constructed based on a top-down logic, from region to object level, from a single target to inter-target relationships, covering holistic aspects of spatial and attribute understanding. The overall pipeline incorporates powerful VLMs via carefully designed prompts to initialize the annotations efficiently and further involve humans' correction in the loop to ensure the annotations are natural, correct, and comprehensive. Built upon existing 3D scanning data, the resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks. We evaluate representative baselines on our benchmarks, analyze their capabilities in different aspects, and showcase the key problems to be addressed in the future. Furthermore, we use this high-quality dataset to train state-of-the-art 3D visual grounding and LLMs and obtain remarkable performance improvement both on existing benchmarks and in-the-wild evaluation. Codes, datasets, and benchmarks will be available at https://github.com/OpenRobotLab/EmbodiedScan.

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

TL;DR

Abstract

Paper Structure (35 sections, 1 equation, 17 figures, 15 tables)

This paper contains 35 sections, 1 equation, 17 figures, 15 tables.

Introduction
Related Work
Dataset
Data Collection & Processing
Meta-annotations
Post-processing
Analysis
Experiments
3D Visual Grounding Benchmark
3D Question Answering Benchmark
3D Captioning Benchmark
Analysis
Limitations and Conclusion
Demo Video
Annotation Details
...and 20 more sections

Figures (17)

Figure 1: MMScan provides the largest ever multi-modal 3D scene dataset with 6.9M hierarchical grounded language annotations, covering holistic aspects on both object- and region-level.
Figure 2: Object-level (top) and region-level (down) meta-annotation UI, pipeline, and examples.
Figure 3: Post-processed annotations for benchmarks. "O" and "R" means "objects" and "regions". Apart from samples shown in the figure, there is a minor part of QA samples for advanced understanding and reasoning, such as situated QA related to everyday life, accounting for 2.18%.
Figure 4: The performance of both tasks grows steadily with the increase of training data, and more diverse scenes can result in more significant improvement.
Figure 5: Visual prompts for object-level meta-annotation. The images are cropped to the project area within the object's bounding box after view selection, leading to images of different sizes.
...and 12 more figures

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

TL;DR

Abstract

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

Authors

TL;DR

Abstract

Table of Contents

Figures (17)