Table of Contents
Fetching ...

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, Xihui Liu, Cewu Lu, Dahua Lin, Jiangmiao Pang

TL;DR

EmbodiedScan tackles the need for holistic 3D scene understanding for embodied AI by introducing a large-scale, multi-modal ego-centric dataset with real-scanned RGB-D data, dense semantic occupancy, 3D oriented boxes, and language prompts. It provides Embodied Perceptron, a unified multi-modal architecture that can fuse any number of views and modalities through an isomorphic multi-level fusion scheme, delivering 3D detection, occupancy, and language-grounded grounding via sparse and dense decoders. The work establishes two benchmark tracks—fundamental 3D perception and language-grounded 3D understanding—with extensive ablations and analyses, demonstrating strong cross-view performance and highlighting the remaining challenges in orientation estimation, domain gaps, and large-vocabulary grounding. Overall, EmbodiedScan offers a data-rich, language-aware platform that paves the way for practical, perception-driven embodied AI in real-world indoor environments.

Abstract

In the realm of computer vision and robotics, embodied agents are expected to explore their environment and carry out human instructions. This necessitates the ability to fully understand 3D scenes given their first-person observations and contextualize them into language for interaction. However, traditional research focuses more on scene-level input and output setups from a global view. To address the gap, we introduce EmbodiedScan, a multi-modal, ego-centric 3D perception dataset and benchmark for holistic 3D scene understanding. It encompasses over 5k scans encapsulating 1M ego-centric RGB-D views, 1M language prompts, 160k 3D-oriented boxes spanning over 760 categories, some of which partially align with LVIS, and dense semantic occupancy with 80 common categories. Building upon this database, we introduce a baseline framework named Embodied Perceptron. It is capable of processing an arbitrary number of multi-modal inputs and demonstrates remarkable 3D perception capabilities, both within the two series of benchmarks we set up, i.e., fundamental 3D perception tasks and language-grounded tasks, and in the wild. Codes, datasets, and benchmarks will be available at https://github.com/OpenRobotLab/EmbodiedScan.

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

TL;DR

EmbodiedScan tackles the need for holistic 3D scene understanding for embodied AI by introducing a large-scale, multi-modal ego-centric dataset with real-scanned RGB-D data, dense semantic occupancy, 3D oriented boxes, and language prompts. It provides Embodied Perceptron, a unified multi-modal architecture that can fuse any number of views and modalities through an isomorphic multi-level fusion scheme, delivering 3D detection, occupancy, and language-grounded grounding via sparse and dense decoders. The work establishes two benchmark tracks—fundamental 3D perception and language-grounded 3D understanding—with extensive ablations and analyses, demonstrating strong cross-view performance and highlighting the remaining challenges in orientation estimation, domain gaps, and large-vocabulary grounding. Overall, EmbodiedScan offers a data-rich, language-aware platform that paves the way for practical, perception-driven embodied AI in real-world indoor environments.

Abstract

In the realm of computer vision and robotics, embodied agents are expected to explore their environment and carry out human instructions. This necessitates the ability to fully understand 3D scenes given their first-person observations and contextualize them into language for interaction. However, traditional research focuses more on scene-level input and output setups from a global view. To address the gap, we introduce EmbodiedScan, a multi-modal, ego-centric 3D perception dataset and benchmark for holistic 3D scene understanding. It encompasses over 5k scans encapsulating 1M ego-centric RGB-D views, 1M language prompts, 160k 3D-oriented boxes spanning over 760 categories, some of which partially align with LVIS, and dense semantic occupancy with 80 common categories. Building upon this database, we introduce a baseline framework named Embodied Perceptron. It is capable of processing an arbitrary number of multi-modal inputs and demonstrates remarkable 3D perception capabilities, both within the two series of benchmarks we set up, i.e., fundamental 3D perception tasks and language-grounded tasks, and in the wild. Codes, datasets, and benchmarks will be available at https://github.com/OpenRobotLab/EmbodiedScan.
Paper Structure (31 sections, 6 equations, 8 figures, 14 tables)

This paper contains 31 sections, 6 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: Comparison with other 3D indoor scene datasets. "Cats" refers to the categories with box annotations for the 3D detection benchmark. EmbodiedScan features more than $10\times$ categories, prompts, and the most diverse annotations. The numbers are still scaling up with further annotations. Mono./Syn./Lang. means Monocular/Synthetic/Language.
  • Figure 2: Dataset composition. EmbodiedScan is composed of three data sources and has similar scans, images, objects, and categories in each of them.
  • Figure 3: EmbodiedScan annotation and statistics. (a) UI for 3D box annotation. We select keyframes and generate their SAM masks with corresponding axis-aligned boxes. With simple clicks, annotators can create 3D boxes for target objects and further adjust them with reference in three orthogonal views and images. (b) Small boxes ($<1m^3$) increase more & prompt statistics. objs/avg./des. refer to objects/average/descriptions. (c) We show the number of instances per category (300 classes). For categories that exist in ScanNet, we plot the absolute increase and observe a significant improvement. (d) We plot the occupancy distribution for each category and see a different word cloud distribution. These two clouds show different aspects, occupied space vs. number of instances, of this dataset.
  • Figure 4: Embodied Perceptron accepts RGB-D sequence with any number of views along with texts as multi-modal input. It uses classical encoders to extract features for each modality and adopts dense and isomorphic sparse fusion with corresponding decoders for different predictions. The 3D features integrated with the text feature can be further used for language-grounded understanding.
  • Figure 5: Complete instance distribution of EmbodiedScan.
  • ...and 3 more figures