Table of Contents
Fetching ...

Holodeck: Language Guided Generation of 3D Embodied AI Environments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, Christopher Clark

TL;DR

Holodeck tackles the challenge of generating diverse, interactive 3D embodied environments by leveraging GPT-4 guided floor plans, materials, and object placement constrained by spatial relations, built atop AI2-THOR and Objaverse. The system's four modules—Floor/Wall, Doorway/Window, Object Selection, and Constraint-based Layout—together with DFS/MILP optimization enable scalable, semantically coherent scenes for varied prompts and styles. Large-scale human evaluation shows Holodeck outperforms procedurally generated baselines on residential scenes and generalizes to diverse environments, while zero-shot navigation experiments demonstrate improved transfer to novel scene types. By providing high-quality, promptable 3D worlds and efficient asset integration, Holodeck advances Embodied AI research toward generalizable scene understanding and navigation.

Abstract

3D simulated environments play a critical role in Embodied AI, but their creation requires expertise and extensive manual effort, restricting their diversity and scope. To mitigate this limitation, we present Holodeck, a system that generates 3D environments to match a user-supplied prompt fully automatedly. Holodeck can generate diverse scenes, e.g., arcades, spas, and museums, adjust the designs for styles, and can capture the semantics of complex queries such as "apartment for a researcher with a cat" and "office of a professor who is a fan of Star Wars". Holodeck leverages a large language model (i.e., GPT-4) for common sense knowledge about what the scene might look like and uses a large collection of 3D assets from Objaverse to populate the scene with diverse objects. To address the challenge of positioning objects correctly, we prompt GPT-4 to generate spatial relational constraints between objects and then optimize the layout to satisfy those constraints. Our large-scale human evaluation shows that annotators prefer Holodeck over manually designed procedural baselines in residential scenes and that Holodeck can produce high-quality outputs for diverse scene types. We also demonstrate an exciting application of Holodeck in Embodied AI, training agents to navigate in novel scenes like music rooms and daycares without human-constructed data, which is a significant step forward in developing general-purpose embodied agents.

Holodeck: Language Guided Generation of 3D Embodied AI Environments

TL;DR

Holodeck tackles the challenge of generating diverse, interactive 3D embodied environments by leveraging GPT-4 guided floor plans, materials, and object placement constrained by spatial relations, built atop AI2-THOR and Objaverse. The system's four modules—Floor/Wall, Doorway/Window, Object Selection, and Constraint-based Layout—together with DFS/MILP optimization enable scalable, semantically coherent scenes for varied prompts and styles. Large-scale human evaluation shows Holodeck outperforms procedurally generated baselines on residential scenes and generalizes to diverse environments, while zero-shot navigation experiments demonstrate improved transfer to novel scene types. By providing high-quality, promptable 3D worlds and efficient asset integration, Holodeck advances Embodied AI research toward generalizable scene understanding and navigation.

Abstract

3D simulated environments play a critical role in Embodied AI, but their creation requires expertise and extensive manual effort, restricting their diversity and scope. To mitigate this limitation, we present Holodeck, a system that generates 3D environments to match a user-supplied prompt fully automatedly. Holodeck can generate diverse scenes, e.g., arcades, spas, and museums, adjust the designs for styles, and can capture the semantics of complex queries such as "apartment for a researcher with a cat" and "office of a professor who is a fan of Star Wars". Holodeck leverages a large language model (i.e., GPT-4) for common sense knowledge about what the scene might look like and uses a large collection of 3D assets from Objaverse to populate the scene with diverse objects. To address the challenge of positioning objects correctly, we prompt GPT-4 to generate spatial relational constraints between objects and then optimize the layout to satisfy those constraints. Our large-scale human evaluation shows that annotators prefer Holodeck over manually designed procedural baselines in residential scenes and that Holodeck can produce high-quality outputs for diverse scene types. We also demonstrate an exciting application of Holodeck in Embodied AI, training agents to navigate in novel scenes like music rooms and daycares without human-constructed data, which is a significant step forward in developing general-purpose embodied agents.
Paper Structure (22 sections, 1 equation, 27 figures, 2 tables)

This paper contains 22 sections, 1 equation, 27 figures, 2 tables.

Figures (27)

  • Figure 1: Example outputs of Holodeck---a large language model powered system, which can generate diverse types of environments (arcade, spa, museum), customize for styles (Victorian-style), and understand fine-grained requirements ("has a cat", "fan of Star Wars").
  • Figure 2: Given a text input, Holodeck generates the 3D environment through multiple rounds of conversation with an LLM.
  • Figure 3: Floorplan Customizability.Holodeck can interpret complicated input and craft reasonable floor plans correspondingly.
  • Figure 4: Material Customizability.Holodeck can select appropriate floor and wall materials to make the scenes more realistic.
  • Figure 5: Door & window Customizability.Holodeck can adjust the size, quantity, position, etc., of doors & windows based on the input.
  • ...and 22 more figures