Table of Contents
Fetching ...

3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation

Frank Zhang, Yibo Zhang, Quan Zheng, Rui Ma, Wei Hua, Hujun Bao, Weiwei Xu, Changqing Zou

TL;DR

This work employs a tri-plane features-based NeRF as a unified representation of the 3D scene to constrain global 3D consistency, and proposes a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior from 2D dif-fusion model as well as the global 3D information of the current scene.

Abstract

Text-driven 3D scene generation techniques have made rapid progress in recent years. Their success is mainly attributed to using existing generative models to iteratively perform image warping and inpainting to generate 3D scenes. However, these methods heavily rely on the outputs of existing models, leading to error accumulation in geometry and appearance that prevent the models from being used in various scenarios (e.g., outdoor and unreal scenarios). To address this limitation, we generatively refine the newly generated local views by querying and aggregating global 3D information, and then progressively generate the 3D scene. Specifically, we employ a tri-plane features-based NeRF as a unified representation of the 3D scene to constrain global 3D consistency, and propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior from 2D diffusion model as well as the global 3D information of the current scene. Our extensive experiments demonstrate that, in comparison to previous methods, our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.

3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation

TL;DR

This work employs a tri-plane features-based NeRF as a unified representation of the 3D scene to constrain global 3D consistency, and proposes a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior from 2D dif-fusion model as well as the global 3D information of the current scene.

Abstract

Text-driven 3D scene generation techniques have made rapid progress in recent years. Their success is mainly attributed to using existing generative models to iteratively perform image warping and inpainting to generate 3D scenes. However, these methods heavily rely on the outputs of existing models, leading to error accumulation in geometry and appearance that prevent the models from being used in various scenarios (e.g., outdoor and unreal scenarios). To address this limitation, we generatively refine the newly generated local views by querying and aggregating global 3D information, and then progressively generate the 3D scene. Specifically, we employ a tri-plane features-based NeRF as a unified representation of the 3D scene to constrain global 3D consistency, and propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior from 2D diffusion model as well as the global 3D information of the current scene. Our extensive experiments demonstrate that, in comparison to previous methods, our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.
Paper Structure (15 sections, 6 equations, 7 figures, 4 tables)

This paper contains 15 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Text-Driven 3D Scene Generation from text prompts. (a) Given a scene description prompt and an arbitrary 6-degree-of-freedom (6-DOF) camera trajectory, our approach progressively generates the full 3D scene by continuously synthesizing 2D novel views. (b) The limitation of mesh representations text2roomscenescape and the lack of reasonable rectification mechanisms lead to cumulative errors in outdoor scenes, which are respectively marked with yellow and blue dash line boxes. In contrast, our approach can alleviate the problem by introducing a progressive generation pipeline.
  • Figure 2: Comparison with existing designs. (a) The feed-forward approaches use depth-based warping and refinement operations to generate novel views of the scene without a unified representation. (b) The warping-inpainting approaches use mesh as a unified representation and generate the scene through iterative inpainting. (c) We replace the mesh with NeRF as the unified representation and alleviate the cumulative error issue by incorporating a generative refinement model. This allows our framework to support the generation of a wider range of scene types. The table at the bottom illustrates the unique feature of the proposed approach. We use a tick with a cross on it for SceneScape because it only supports backward camera movement, not able to provide a full unbounded generation.
  • Figure 3: Overview of our pipeline. (a) Scene Context Initialization contains a supporting database to provide novel viewpoint data for progressive generation. (b) Unified 3D Representation provides a unified representation for the generated scene, which allows our approach to accomplish more general scene generation and to hold the 3D consistency at the same time. (c) 3D-Aware Generative Refinement alleviates the cumulative error issue during long-term extrapolation by exploiting large-scale natural images prior to generatively refine the synthesized novel viewpoint image. The consistency regularization module is used for test-time optimization.
  • Figure 4: Quantitative Results. From our results, it can be seen that our approach produces high-fidelity scenes with stable 3D consistency in indoor scenes, outdoor scenes, and unreal-style scenes. More high-resolution results can be found in the supplementary material.
  • Figure 5: Comparison with text-to-panorama methods. It can be seen that although our method is not trained on panoramic data, it can also generate multiple views with cross-view consistency.
  • ...and 2 more figures