Table of Contents
Fetching ...

Text2Immersion: Generative Immersive Scene with 3D Gaussians

Hao Ouyang, Kathryn Heal, Stephen Lombardi, Tiancheng Sun

TL;DR

Text2Immersion introduces a two-stage Text-to-3D pipeline that uses 3D Gaussians to generate immersive, photorealistic scenes from text prompts. The first stage initializes a coarse Gaussian cloud from anchor-view diffusion and depth, while the second stage refines the scene by adding views and applying inpainting and super-resolution via split-and-clone operations. The approach achieves high rendering fidelity, 3D consistency, and real-time performance, outperforming baselines in quality and text alignment. It enables diverse outdoor, stylized, and imaginary scenes, including 360-degree VR outputs, with robust ablations and demonstrated versatility. Limitations relate to depth reliability in initialization and ghosting during refinement, pointing to avenues for stronger 2D supervision and depth robustness.

Abstract

We introduce Text2Immersion, an elegant method for producing high-quality 3D immersive scenes from text prompts. Our proposed pipeline initiates by progressively generating a Gaussian cloud using pre-trained 2D diffusion and depth estimation models. This is followed by a refining stage on the Gaussian cloud, interpolating and refining it to enhance the details of the generated scene. Distinct from prevalent methods that focus on single object or indoor scenes, or employ zoom-out trajectories, our approach generates diverse scenes with various objects, even extending to the creation of imaginary scenes. Consequently, Text2Immersion can have wide-ranging implications for various applications such as virtual reality, game development, and automated content creation. Extensive evaluations demonstrate that our system surpasses other methods in rendering quality and diversity, further progressing towards text-driven 3D scene generation. We will make the source code publicly accessible at the project page.

Text2Immersion: Generative Immersive Scene with 3D Gaussians

TL;DR

Text2Immersion introduces a two-stage Text-to-3D pipeline that uses 3D Gaussians to generate immersive, photorealistic scenes from text prompts. The first stage initializes a coarse Gaussian cloud from anchor-view diffusion and depth, while the second stage refines the scene by adding views and applying inpainting and super-resolution via split-and-clone operations. The approach achieves high rendering fidelity, 3D consistency, and real-time performance, outperforming baselines in quality and text alignment. It enables diverse outdoor, stylized, and imaginary scenes, including 360-degree VR outputs, with robust ablations and demonstrated versatility. Limitations relate to depth reliability in initialization and ghosting during refinement, pointing to avenues for stronger 2D supervision and depth robustness.

Abstract

We introduce Text2Immersion, an elegant method for producing high-quality 3D immersive scenes from text prompts. Our proposed pipeline initiates by progressively generating a Gaussian cloud using pre-trained 2D diffusion and depth estimation models. This is followed by a refining stage on the Gaussian cloud, interpolating and refining it to enhance the details of the generated scene. Distinct from prevalent methods that focus on single object or indoor scenes, or employ zoom-out trajectories, our approach generates diverse scenes with various objects, even extending to the creation of imaginary scenes. Consequently, Text2Immersion can have wide-ranging implications for various applications such as virtual reality, game development, and automated content creation. Extensive evaluations demonstrate that our system surpasses other methods in rendering quality and diversity, further progressing towards text-driven 3D scene generation. We will make the source code publicly accessible at the project page.
Paper Structure (21 sections, 9 equations, 11 figures, 2 tables)

This paper contains 21 sections, 9 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: The overview of our pipeline.. Our generation pipeline consists of two stages. In the first stage, we rotate the camera from the central view, and use diffusion models and monocular depth prediction modules to initialize a coarse Gaussian cloud. In the second stage, we sample more cameras around the center, and use diffusion-based inpainting modules to further refine the Gaussian cloud.
  • Figure 2: Qualitative comparison with baselines including DreamFusion poole2022dreamfusion, DreamGaussian tang2023dreamgaussian, Text2Room hoellein2023text2room, Text2Nerf zhang2023text2nerf on the generation quality. We highly recommend that readers view the accompanying videos for a more thorough comparison.
  • Figure 3: Diverse Output Generation: Our pipeline is capable of synthesizing a variety of 3D scenes using the same prompts. We also demonstrate its ability to generate stylized scenes.
  • Figure 4: Ablation study on the refinement stage, which shows that the refinement helps filling in the missing regions and reducing the noisy appearance. After the refinement, the center of Gaussians effectively covering the original missing parts.
  • Figure 5: Ablation study on the initialization. Without using the anchor cameras, the generated images contain obvious gap.
  • ...and 6 more figures