Table of Contents
Fetching ...

HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing

Zixuan Bian, Ruohan Ren, Yue Yang, Chris Callison-Burch

TL;DR

Holodeck 2.0 presents a unified vision-language-guided pipeline for generating and editing diverse 3D scenes from text, addressing open-domain and style-consistent requirements. By chaining Scene Analysis, Object Generation, and Scene Generation with a DFS-based layout solver and interactive Scene Editing, it achieves high semantic fidelity, physical plausibility, and style coherence. Comprehensive human and automated evaluations demonstrate consistent superiority over baselines across indoor and open-domain scenarios, and practical integration with Unreal Engine highlights real-world applicability for game development. The work advances open-world 3D content creation by combining VLM-driven reasoning with high-quality 3D asset generation and editable, style-consistent layouts.

Abstract

3D scene generation plays a crucial role in gaming, artistic creation, virtual reality, and many other domains. However, current 3D scene design still relies heavily on extensive manual effort from creators, and existing automated methods struggle to generate open-domain scenes or support flexible editing. To address those challenges, we introduce HOLODECK 2.0, an advanced vision-language-guided framework for 3D world generation with support for interactive scene editing based on human feedback. HOLODECK 2.0 can generate diverse and stylistically rich 3D scenes (e.g., realistic, cartoon, anime, and cyberpunk styles) that exhibit high semantic fidelity to fine-grained input descriptions, suitable for both indoor and open-domain environments. HOLODECK 2.0 leverages vision-language models (VLMs) to identify and parse the objects required in a scene and generates corresponding high-quality assets via state-of-the-art 3D generative models. Then, HOLODECK 2.0 iteratively applies spatial constraints derived from the VLMs to achieve semantically coherent and physically plausible layouts. Both human and model evaluations demonstrate that HOLODECK 2.0 effectively generates high-quality scenes closely aligned with detailed textual descriptions, consistently outperforming baselines across indoor and open-domain scenarios. Additionally, HOLODECK 2.0 provides editing capabilities that flexibly adapt to human feedback, supporting layout refinement and style-consistent object edits. Finally, we present a practical application of HOLODECK 2.0 in procedural game modeling to generate visually rich and immersive environments that can boost efficiency in game design.

HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing

TL;DR

Holodeck 2.0 presents a unified vision-language-guided pipeline for generating and editing diverse 3D scenes from text, addressing open-domain and style-consistent requirements. By chaining Scene Analysis, Object Generation, and Scene Generation with a DFS-based layout solver and interactive Scene Editing, it achieves high semantic fidelity, physical plausibility, and style coherence. Comprehensive human and automated evaluations demonstrate consistent superiority over baselines across indoor and open-domain scenarios, and practical integration with Unreal Engine highlights real-world applicability for game development. The work advances open-world 3D content creation by combining VLM-driven reasoning with high-quality 3D asset generation and editable, style-consistent layouts.

Abstract

3D scene generation plays a crucial role in gaming, artistic creation, virtual reality, and many other domains. However, current 3D scene design still relies heavily on extensive manual effort from creators, and existing automated methods struggle to generate open-domain scenes or support flexible editing. To address those challenges, we introduce HOLODECK 2.0, an advanced vision-language-guided framework for 3D world generation with support for interactive scene editing based on human feedback. HOLODECK 2.0 can generate diverse and stylistically rich 3D scenes (e.g., realistic, cartoon, anime, and cyberpunk styles) that exhibit high semantic fidelity to fine-grained input descriptions, suitable for both indoor and open-domain environments. HOLODECK 2.0 leverages vision-language models (VLMs) to identify and parse the objects required in a scene and generates corresponding high-quality assets via state-of-the-art 3D generative models. Then, HOLODECK 2.0 iteratively applies spatial constraints derived from the VLMs to achieve semantically coherent and physically plausible layouts. Both human and model evaluations demonstrate that HOLODECK 2.0 effectively generates high-quality scenes closely aligned with detailed textual descriptions, consistently outperforming baselines across indoor and open-domain scenarios. Additionally, HOLODECK 2.0 provides editing capabilities that flexibly adapt to human feedback, supporting layout refinement and style-consistent object edits. Finally, we present a practical application of HOLODECK 2.0 in procedural game modeling to generate visually rich and immersive environments that can boost efficiency in game design.

Paper Structure

This paper contains 31 sections, 2 equations, 17 figures, 3 tables, 1 algorithm.

Figures (17)

  • Figure 1: Examples of stylistically varied 3D scenes generated by Holodeck 2.0. Captions are shortened versions of the original long inputs (50–300 words). Holodeck 2.0 can faithfully capture fine-grained details from textual descriptions.
  • Figure 1: In-game interface of the 3D scene generated by Holodeck 2.0
  • Figure 2: Overview of Holodeck 2.0. Given a text input, Holodeck 2.0 generates 3D scenes via three modules: (1) Scene Analysis Module takes text as input and outputs object properties in JSON format, along with a reference image, individual object images, and a background image; (2) Object Generation Module takes individual object images as input and outputs textured 3D assets; (3) Scene Generation Module takes text, reference image, and object properties as input, and outputs the final 3D scene, with an optional editing module for human feedback.
  • Figure 2: Scenes generated by Holodeck 2.0 and the corresponding fine-grained input text
  • Figure 3: Examples of the Scene Analysis Module and the Object Generation Module. Holodeck 2.0 can generate customized, stylistically diverse 3D objects that precisely match fine-grained textual descriptions.
  • ...and 12 more figures