Text2Immersion: Generative Immersive Scene with 3D Gaussians
Hao Ouyang, Kathryn Heal, Stephen Lombardi, Tiancheng Sun
TL;DR
Text2Immersion introduces a two-stage Text-to-3D pipeline that uses 3D Gaussians to generate immersive, photorealistic scenes from text prompts. The first stage initializes a coarse Gaussian cloud from anchor-view diffusion and depth, while the second stage refines the scene by adding views and applying inpainting and super-resolution via split-and-clone operations. The approach achieves high rendering fidelity, 3D consistency, and real-time performance, outperforming baselines in quality and text alignment. It enables diverse outdoor, stylized, and imaginary scenes, including 360-degree VR outputs, with robust ablations and demonstrated versatility. Limitations relate to depth reliability in initialization and ghosting during refinement, pointing to avenues for stronger 2D supervision and depth robustness.
Abstract
We introduce Text2Immersion, an elegant method for producing high-quality 3D immersive scenes from text prompts. Our proposed pipeline initiates by progressively generating a Gaussian cloud using pre-trained 2D diffusion and depth estimation models. This is followed by a refining stage on the Gaussian cloud, interpolating and refining it to enhance the details of the generated scene. Distinct from prevalent methods that focus on single object or indoor scenes, or employ zoom-out trajectories, our approach generates diverse scenes with various objects, even extending to the creation of imaginary scenes. Consequently, Text2Immersion can have wide-ranging implications for various applications such as virtual reality, game development, and automated content creation. Extensive evaluations demonstrate that our system surpasses other methods in rendering quality and diversity, further progressing towards text-driven 3D scene generation. We will make the source code publicly accessible at the project page.
