Table of Contents
Fetching ...

InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts

Weipeng Zhong, Peizhou Cao, Yichen Jin, Li Luo, Wenzhe Cai, Jingli Lin, Hanqing Wang, Zhaoyang Lyu, Tai Wang, Bo Dai, Xudong Xu, Jiangmiao Pang

TL;DR

InternScenes addresses the need for large-scale, simulatable indoor scenes with realistic layouts and interactive objects to advance Embodied AI. It fuses real-world scans, procedurally generated content, and designer-created scenes into three sub-datasets (Real2Sim, Gen, Synthetic), totaling around 40k scenes and 1.96M objects across 288 classes. A physics-aware pipeline—combining bounding-box optimization, convex decomposition, and SAPIEN simulation—ensures collision-free, physically plausible layouts while preserving dense small-item distributions. The dataset enables robust benchmarks for interior scene generation and point-goal navigation, and is intended to be open-sourced to accelerate research and real-world deployment from simulation to embodied AI and AIGC applications.

Abstract

The advancement of Embodied AI heavily relies on large-scale, simulatable 3D scene datasets characterized by scene diversity and realistic layouts. However, existing datasets typically suffer from limitations in data scale or diversity, sanitized layouts lacking small items, and severe object collisions. To address these shortcomings, we introduce \textbf{InternScenes}, a novel large-scale simulatable indoor scene dataset comprising approximately 40,000 diverse scenes by integrating three disparate scene sources, real-world scans, procedurally generated scenes, and designer-created scenes, including 1.96M 3D objects and covering 15 common scene types and 288 object classes. We particularly preserve massive small items in the scenes, resulting in realistic and complex layouts with an average of 41.5 objects per region. Our comprehensive data processing pipeline ensures simulatability by creating real-to-sim replicas for real-world scans, enhances interactivity by incorporating interactive objects into these scenes, and resolves object collisions by physical simulations. We demonstrate the value of InternScenes with two benchmark applications: scene layout generation and point-goal navigation. Both show the new challenges posed by the complex and realistic layouts. More importantly, InternScenes paves the way for scaling up the model training for both tasks, making the generation and navigation in such complex scenes possible. We commit to open-sourcing the data, models, and benchmarks to benefit the whole community.

InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts

TL;DR

InternScenes addresses the need for large-scale, simulatable indoor scenes with realistic layouts and interactive objects to advance Embodied AI. It fuses real-world scans, procedurally generated content, and designer-created scenes into three sub-datasets (Real2Sim, Gen, Synthetic), totaling around 40k scenes and 1.96M objects across 288 classes. A physics-aware pipeline—combining bounding-box optimization, convex decomposition, and SAPIEN simulation—ensures collision-free, physically plausible layouts while preserving dense small-item distributions. The dataset enables robust benchmarks for interior scene generation and point-goal navigation, and is intended to be open-sourced to accelerate research and real-world deployment from simulation to embodied AI and AIGC applications.

Abstract

The advancement of Embodied AI heavily relies on large-scale, simulatable 3D scene datasets characterized by scene diversity and realistic layouts. However, existing datasets typically suffer from limitations in data scale or diversity, sanitized layouts lacking small items, and severe object collisions. To address these shortcomings, we introduce \textbf{InternScenes}, a novel large-scale simulatable indoor scene dataset comprising approximately 40,000 diverse scenes by integrating three disparate scene sources, real-world scans, procedurally generated scenes, and designer-created scenes, including 1.96M 3D objects and covering 15 common scene types and 288 object classes. We particularly preserve massive small items in the scenes, resulting in realistic and complex layouts with an average of 41.5 objects per region. Our comprehensive data processing pipeline ensures simulatability by creating real-to-sim replicas for real-world scans, enhances interactivity by incorporating interactive objects into these scenes, and resolves object collisions by physical simulations. We demonstrate the value of InternScenes with two benchmark applications: scene layout generation and point-goal navigation. Both show the new challenges posed by the complex and realistic layouts. More importantly, InternScenes paves the way for scaling up the model training for both tasks, making the generation and navigation in such complex scenes possible. We commit to open-sourcing the data, models, and benchmarks to benefit the whole community.

Paper Structure

This paper contains 20 sections, 5 equations, 19 figures, 8 tables, 1 algorithm.

Figures (19)

  • Figure 1: InternScenes is a large-scale, simulatable indoor scene dataset with diverse layouts and various 3D objects. It supports various tasks, such as scene layout generation and vision navigation.
  • Figure 2: Pipeline for retrieving synthetic scenes from real scan scenes
  • Figure 3: Pipeline for annotating and processing raw scenes to extract precise layout information.
  • Figure 4: Examples from InternScenes-Real2Sim. Each scene shows its BEV map as well as one isometric view.
  • Figure 5: Examples from InternScenes-Gen. The BEV map and one isometric view are shown.
  • ...and 14 more figures