Table of Contents
Fetching ...

EmbodiedGen: Towards a Generative 3D World Engine for Embodied Intelligence

Xinjie Wang, Liu Liu, Yu Cao, Ruiqi Wu, Wenkang Qin, Dehui Wang, Wei Sui, Zhizhong Su

TL;DR

EmbodiedGen addresses the high cost and limited realism of existing embodied AI data by offering an end-to-end, open-source platform for generating interactive 3D worlds with real-world scale and physical properties. It combines Image-to-3D and Text-to-3D pipelines with texture, articulated-object, and scene generation, augmented by automated quality inspection and physics restoration to ensure simulator-ready URDF assets. Key innovations include a physics-aware image-to-3D pipeline with a GPT-4o/Qwen-based physics expert, GeoLifter for multi-view texture conditioning, DIPO-driven articulated object generation, and panorama-based scene construction with scale restoration. The framework enables real-to-sim digital twins and large-scale data augmentation across major simulators, accelerating embodied intelligence research and enabling scalable, realistic evaluation in diverse environments.

Abstract

Constructing a physically realistic and accurately scaled simulated 3D world is crucial for the training and evaluation of embodied intelligence tasks. The diversity, realism, low cost accessibility and affordability of 3D data assets are critical for achieving generalization and scalability in embodied AI. However, most current embodied intelligence tasks still rely heavily on traditional 3D computer graphics assets manually created and annotated, which suffer from high production costs and limited realism. These limitations significantly hinder the scalability of data driven approaches. We present EmbodiedGen, a foundational platform for interactive 3D world generation. It enables the scalable generation of high-quality, controllable and photorealistic 3D assets with accurate physical properties and real-world scale in the Unified Robotics Description Format (URDF) at low cost. These assets can be directly imported into various physics simulation engines for fine-grained physical control, supporting downstream tasks in training and evaluation. EmbodiedGen is an easy-to-use, full-featured toolkit composed of six key modules: Image-to-3D, Text-to-3D, Texture Generation, Articulated Object Generation, Scene Generation and Layout Generation. EmbodiedGen generates diverse and interactive 3D worlds composed of generative 3D assets, leveraging generative AI to address the challenges of generalization and evaluation to the needs of embodied intelligence related research. Code is available at https://horizonrobotics.github.io/robot_lab/embodied_gen/index.html.

EmbodiedGen: Towards a Generative 3D World Engine for Embodied Intelligence

TL;DR

EmbodiedGen addresses the high cost and limited realism of existing embodied AI data by offering an end-to-end, open-source platform for generating interactive 3D worlds with real-world scale and physical properties. It combines Image-to-3D and Text-to-3D pipelines with texture, articulated-object, and scene generation, augmented by automated quality inspection and physics restoration to ensure simulator-ready URDF assets. Key innovations include a physics-aware image-to-3D pipeline with a GPT-4o/Qwen-based physics expert, GeoLifter for multi-view texture conditioning, DIPO-driven articulated object generation, and panorama-based scene construction with scale restoration. The framework enables real-to-sim digital twins and large-scale data augmentation across major simulators, accelerating embodied intelligence research and enabling scalable, realistic evaluation in diverse environments.

Abstract

Constructing a physically realistic and accurately scaled simulated 3D world is crucial for the training and evaluation of embodied intelligence tasks. The diversity, realism, low cost accessibility and affordability of 3D data assets are critical for achieving generalization and scalability in embodied AI. However, most current embodied intelligence tasks still rely heavily on traditional 3D computer graphics assets manually created and annotated, which suffer from high production costs and limited realism. These limitations significantly hinder the scalability of data driven approaches. We present EmbodiedGen, a foundational platform for interactive 3D world generation. It enables the scalable generation of high-quality, controllable and photorealistic 3D assets with accurate physical properties and real-world scale in the Unified Robotics Description Format (URDF) at low cost. These assets can be directly imported into various physics simulation engines for fine-grained physical control, supporting downstream tasks in training and evaluation. EmbodiedGen is an easy-to-use, full-featured toolkit composed of six key modules: Image-to-3D, Text-to-3D, Texture Generation, Articulated Object Generation, Scene Generation and Layout Generation. EmbodiedGen generates diverse and interactive 3D worlds composed of generative 3D assets, leveraging generative AI to address the challenges of generalization and evaluation to the needs of embodied intelligence related research. Code is available at https://horizonrobotics.github.io/robot_lab/embodied_gen/index.html.

Paper Structure

This paper contains 34 sections, 3 equations, 23 figures, 1 algorithm.

Figures (23)

  • Figure 1: EmbodiedGen, a toolkit for embodied intelligence interactive 3D world generation. EmbodiedGen enables controllable generation of rigid and articulated assets with accurate real-world scale and physical properties, along with stylistically diverse background generation and visually rich texture generation and editing. These assets can be seamlessly integrated into various simulators such as OpenAI Gymopenai_gym, Isaac Labmittal2023orbit, MuJoCotodorov2012mujoco and SAPIENXiang_2020_SAPIEN. These capabilities form a foundation for digital twinning, large-scale data augmentation and embodied intelligence tasks such as manipulation and navigation across a wide range of simulation environments.
  • Figure 2: The framework of EmbodiedGen. It enables the creation of a digital twin within a simulation environment from a single image. Alternatively, given a task description, EmbodiedGen autonomously generates the scene layout, synthesizes detailed 3D object assets, and arranges them in semantically and physically plausible configurations. This facilitates the effortless construction of an interactive 3D world, supporting a wide range of embodied intelligence related research in diverse virtual environments.
  • Figure 3: Overview of EmbodiedGen Image-to-3D Pipeline. From a single image, the system generates mesh and 3DGS assets, conducts automatic quality inspectioin (aesthetics, segmentation, geometry), and re-generate failed outputs by auto-adjusted settings. A physics expert module restores real-world scale and physical semantics, and the assets are saved in URDF format.
  • Figure 4: AestheticChecker is used to evaluate the texture quality of generated assets. Assets displaying richer texture details receiving higher scores.
  • Figure 5: Examples of segmentation failure cases automatically filtered by ImageSegChecker.
  • ...and 18 more figures