Table of Contents
Fetching ...

SceneWiz3D: Towards Text-guided 3D Scene Composition

Qihang Zhang, Chaoyang Wang, Aliaksandr Siarohin, Peiye Zhuang, Yinghao Xu, Ceyuan Yang, Dahua Lin, Bolei Zhou, Sergey Tulyakov, Hsin-Ying Lee

TL;DR

SceneWiz3D tackles the challenge of text-to-3D scene synthesis by introducing a hybrid explicit-implicit representation that treats objects of interest with explicit DMTet meshes and the background with an implicit NeRF. It simultaneously optimizes object layouts using Particle Swarm Optimization and guides both perspective and panoramic views with diffusion priors (SDS/VSD for perspective and LDM3D-pano for panoramas) plus a MiDaS-based depth regularizer. The method achieves state-of-the-art results in both appearance and geometry across diverse indoor scenes and supports user-provided or text-generated objects, demonstrating strong generalization and scene-editing capabilities. Despite longer optimization times inherent to SDS-based approaches, SceneWiz3D offers a practical, flexible framework for high-fidelity, view-consistent 3D scene generation from natural language prompts.

Abstract

We are witnessing significant breakthroughs in the technology for generating 3D objects from text. Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets. Generating entire scenes, however, remains very challenging as a scene contains multiple 3D objects, diverse and scattered. In this work, we introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text. We marry the locality of objects with globality of scenes by introducing a hybrid 3D representation: explicit for objects and implicit for scenes. Remarkably, an object, being represented explicitly, can be either generated from text using conventional text-to-3D approaches, or provided by users. To configure the layout of the scene and automatically place objects, we apply the Particle Swarm Optimization technique during the optimization process. Furthermore, it is difficult for certain parts of the scene (e.g., corners, occlusion) to receive multi-view supervision, leading to inferior geometry. We incorporate an RGBD panorama diffusion model to mitigate it, resulting in high-quality geometry. Extensive evaluation supports that our approach achieves superior quality over previous approaches, enabling the generation of detailed and view-consistent 3D scenes.

SceneWiz3D: Towards Text-guided 3D Scene Composition

TL;DR

SceneWiz3D tackles the challenge of text-to-3D scene synthesis by introducing a hybrid explicit-implicit representation that treats objects of interest with explicit DMTet meshes and the background with an implicit NeRF. It simultaneously optimizes object layouts using Particle Swarm Optimization and guides both perspective and panoramic views with diffusion priors (SDS/VSD for perspective and LDM3D-pano for panoramas) plus a MiDaS-based depth regularizer. The method achieves state-of-the-art results in both appearance and geometry across diverse indoor scenes and supports user-provided or text-generated objects, demonstrating strong generalization and scene-editing capabilities. Despite longer optimization times inherent to SDS-based approaches, SceneWiz3D offers a practical, flexible framework for high-fidelity, view-consistent 3D scene generation from natural language prompts.

Abstract

We are witnessing significant breakthroughs in the technology for generating 3D objects from text. Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets. Generating entire scenes, however, remains very challenging as a scene contains multiple 3D objects, diverse and scattered. In this work, we introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text. We marry the locality of objects with globality of scenes by introducing a hybrid 3D representation: explicit for objects and implicit for scenes. Remarkably, an object, being represented explicitly, can be either generated from text using conventional text-to-3D approaches, or provided by users. To configure the layout of the scene and automatically place objects, we apply the Particle Swarm Optimization technique during the optimization process. Furthermore, it is difficult for certain parts of the scene (e.g., corners, occlusion) to receive multi-view supervision, leading to inferior geometry. We incorporate an RGBD panorama diffusion model to mitigate it, resulting in high-quality geometry. Extensive evaluation supports that our approach achieves superior quality over previous approaches, enabling the generation of detailed and view-consistent 3D scenes.
Paper Structure (18 sections, 6 equations, 11 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 6 equations, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: Diverse 3D scenes synthesized by our method. Our method can incorporate objects that are either automatically generated from text prompts (denoted as , top left, top right), or provided by a user (denoted as , middle right). Our method generalizes to different scene types and styles and supports scene manipulation, such as moving or deleting objects (bottom).
  • Figure 2: SceneWiz3D Overview. To model 3D scenes, we adopt a hybrid representation containing explicit and implicit components: DMTets for objects of interest (OOIs) and NeRF for the environment. Given a text prompt, we first identify OOIs of the scene, and initialize their DMTets. We update the OOIs' configurations with Particle Swarm Optimization based on CLIP similarity, and update both OOIs and the environment by score distillation with a text-to-image diffusion model, a panoramic RGBD diffusion model, and a depth regularizer.
  • Figure 3: Discovering scene configurations for both 2D (top) and 3D (bottom). Naive gradient-based optimization suffers from local minima imposed by low-dimensional and non-convex optimization space, leading to improbable configurations (objects overlap or are incorrectly placed). Particle Swarm Optimization, instead, correctly identifies plausible configurations.
  • Figure 4: Qualitative comparisons with baselines. prompt: a bedroom, realistic detailed photo.
  • Figure 5: Qualitative ablation study. prompt: a washing room, realistic detailed photo.
  • ...and 6 more figures