Table of Contents
Fetching ...

Layout2Scene: 3D Semantic Layout Guided Scene Generation via Geometry and Appearance Diffusion Priors

Minglin Chen, Longguang Wang, Sheng Ao, Ye Zhang, Kai Xu, Yulan Guo

TL;DR

Layout2Scene introduces a 3D semantic layout guided text-to-scene generation framework that decouples objects from backgrounds via a hybrid scene representation and refines geometry and appearance in two stages using diffusion priors. By initializing with a pre trained text-to-3D model and applying layout aware camera sampling, semantic-guided geometry diffusion, and semantic-geometry guided appearance diffusion, the method achieves superior plausibility and editability compared to prior approaches. Training on SunRGBD and evaluation with CS and IS demonstrate improved fidelity and realism, with efficient training times (~1.5 hours) and rendering speeds (~30 FPS). The approach enables precise control over object locations and supports downstream editing and applications in complex 3D scene synthesis.

Abstract

3D scene generation conditioned on text prompts has significantly progressed due to the development of 2D diffusion generation models. However, the textual description of 3D scenes is inherently inaccurate and lacks fine-grained control during training, leading to implausible scene generation. As an intuitive and feasible solution, the 3D layout allows for precise specification of object locations within the scene. To this end, we present a text-to-scene generation method (namely, Layout2Scene) using additional semantic layout as the prompt to inject precise control of 3D object positions. Specifically, we first introduce a scene hybrid representation to decouple objects and backgrounds, which is initialized via a pre-trained text-to-3D model. Then, we propose a two-stage scheme to optimize the geometry and appearance of the initialized scene separately. To fully leverage 2D diffusion priors in geometry and appearance generation, we introduce a semantic-guided geometry diffusion model and a semantic-geometry guided diffusion model which are finetuned on a scene dataset. Extensive experiments demonstrate that our method can generate more plausible and realistic scenes as compared to state-of-the-art approaches. Furthermore, the generated scene allows for flexible yet precise editing, thereby facilitating multiple downstream applications.

Layout2Scene: 3D Semantic Layout Guided Scene Generation via Geometry and Appearance Diffusion Priors

TL;DR

Layout2Scene introduces a 3D semantic layout guided text-to-scene generation framework that decouples objects from backgrounds via a hybrid scene representation and refines geometry and appearance in two stages using diffusion priors. By initializing with a pre trained text-to-3D model and applying layout aware camera sampling, semantic-guided geometry diffusion, and semantic-geometry guided appearance diffusion, the method achieves superior plausibility and editability compared to prior approaches. Training on SunRGBD and evaluation with CS and IS demonstrate improved fidelity and realism, with efficient training times (~1.5 hours) and rendering speeds (~30 FPS). The approach enables precise control over object locations and supports downstream editing and applications in complex 3D scene synthesis.

Abstract

3D scene generation conditioned on text prompts has significantly progressed due to the development of 2D diffusion generation models. However, the textual description of 3D scenes is inherently inaccurate and lacks fine-grained control during training, leading to implausible scene generation. As an intuitive and feasible solution, the 3D layout allows for precise specification of object locations within the scene. To this end, we present a text-to-scene generation method (namely, Layout2Scene) using additional semantic layout as the prompt to inject precise control of 3D object positions. Specifically, we first introduce a scene hybrid representation to decouple objects and backgrounds, which is initialized via a pre-trained text-to-3D model. Then, we propose a two-stage scheme to optimize the geometry and appearance of the initialized scene separately. To fully leverage 2D diffusion priors in geometry and appearance generation, we introduce a semantic-guided geometry diffusion model and a semantic-geometry guided diffusion model which are finetuned on a scene dataset. Extensive experiments demonstrate that our method can generate more plausible and realistic scenes as compared to state-of-the-art approaches. Furthermore, the generated scene allows for flexible yet precise editing, thereby facilitating multiple downstream applications.
Paper Structure (28 sections, 13 equations, 6 figures, 1 table)

This paper contains 28 sections, 13 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Layout2Scene is a 3D semantic layout guided text-to-scene generative model that can create high-fidelity geometry and appearance for complex 3D scenes, while adhering to user-provided object arrangement constraints. (a) The inputs are a 3D semantic layout and a text prompt of the scene. The 3D semantic layout is a collection of semantic bounding boxes, while the text prompt is a brief description (“a living room" in the case). (b) The generated scene exhibits a high-realness appearance and high-quality geometry, displayed through RGB, normal, and depth renderings along a navigation trajectory. Furthermore, the proposed method is capable of accomplishing training in 1.5 hours and rendering at 30 FPS on an NVIDIA V100 GPU.
  • Figure 2: Overview of Layout2Scene. The proposed method takes a 3D semantic layout and a prompt as input. First, we model the scene using a hybrid representation, which is initialized via a pre-trained Text-to-3D model. The layout-aware camera sampling ensures that the sampling images cover the whole scene. Then, we employ a two-stage scheme to refine the geometry and appearance of the initialized scene via diffusion priors. In stage 1, we employ a semantic-guided geometry diffusion model to refine the normal and depth of the scene. In stage 2, we generate the appearance of the scene via semantic-geometry guided diffusion model.
  • Figure 3: Qualitative comparisons of various scene generation approaches.
  • Figure 4: Results of various scene types produced by the proposed method. 1st row: living room. 2nd row: bathroom. For each scene, we show the bird-eye-view of 3D semantic layout on the left, and the rendered RGB, normal, depth maps on the right.
  • Figure 5: Visualization of normal and depth maps with and without the geometry diffusion prior.
  • ...and 1 more figures