Table of Contents
Fetching ...

DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting

Shijie Zhou, Zhiwen Fan, Dejia Xu, Haoran Chang, Pradyumna Chari, Tejas Bharadwaj, Suya You, Zhangyang Wang, Achuta Kadambi

TL;DR

<3-5 sentence high-level summary> DreamScene360 tackles unconstrained 360$^{\circ}$ text-to-3D scene generation by first creating a globally coherent 360° panorama with a diffusion-based 2D model and a self-refinement loop, then lifting the panorama into a 3D representation using Panoramic Gaussian Splatting (3DGS) initialized from monocular depth and refined by a learnable geometric field. The method enforces semantic and geometric consistency across synthetic and real views through a combination of DINOv2-based semantic similarity, DPT-based depth regularization, and parallax augmentation with virtual cameras, enabling robust, view-consistent 3D scenes. An end-to-end optimization combines RGB, semantic, and geometric losses to finely tune Gaussian parameters, delivering globally coherent 360° content with plausible novel-view renderings. Compared to a LucidDreamer baseline, DreamScene360 achieves superior global consistency and 360° coverage, albeit under a 512×1024 panorama resolution limit, with future work aimed at higher-resolution and dynamic 4D content.

Abstract

The increasing demand for virtual reality applications has highlighted the significance of crafting immersive 3D assets. We present a text-to-3D 360$^{\circ}$ scene generation pipeline that facilitates the creation of comprehensive 360$^{\circ}$ scenes for in-the-wild environments in a matter of minutes. Our approach utilizes the generative power of a 2D diffusion model and prompt self-refinement to create a high-quality and globally coherent panoramic image. This image acts as a preliminary "flat" (2D) scene representation. Subsequently, it is lifted into 3D Gaussians, employing splatting techniques to enable real-time exploration. To produce consistent 3D geometry, our pipeline constructs a spatially coherent structure by aligning the 2D monocular depth into a globally optimized point cloud. This point cloud serves as the initial state for the centroids of 3D Gaussians. In order to address invisible issues inherent in single-view inputs, we impose semantic and geometric constraints on both synthesized and input camera views as regularizations. These guide the optimization of Gaussians, aiding in the reconstruction of unseen regions. In summary, our method offers a globally consistent 3D scene within a 360$^{\circ}$ perspective, providing an enhanced immersive experience over existing techniques. Project website at: http://dreamscene360.github.io/

DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting

TL;DR

<3-5 sentence high-level summary> DreamScene360 tackles unconstrained 360 text-to-3D scene generation by first creating a globally coherent 360° panorama with a diffusion-based 2D model and a self-refinement loop, then lifting the panorama into a 3D representation using Panoramic Gaussian Splatting (3DGS) initialized from monocular depth and refined by a learnable geometric field. The method enforces semantic and geometric consistency across synthetic and real views through a combination of DINOv2-based semantic similarity, DPT-based depth regularization, and parallax augmentation with virtual cameras, enabling robust, view-consistent 3D scenes. An end-to-end optimization combines RGB, semantic, and geometric losses to finely tune Gaussian parameters, delivering globally coherent 360° content with plausible novel-view renderings. Compared to a LucidDreamer baseline, DreamScene360 achieves superior global consistency and 360° coverage, albeit under a 512×1024 panorama resolution limit, with future work aimed at higher-resolution and dynamic 4D content.

Abstract

The increasing demand for virtual reality applications has highlighted the significance of crafting immersive 3D assets. We present a text-to-3D 360 scene generation pipeline that facilitates the creation of comprehensive 360 scenes for in-the-wild environments in a matter of minutes. Our approach utilizes the generative power of a 2D diffusion model and prompt self-refinement to create a high-quality and globally coherent panoramic image. This image acts as a preliminary "flat" (2D) scene representation. Subsequently, it is lifted into 3D Gaussians, employing splatting techniques to enable real-time exploration. To produce consistent 3D geometry, our pipeline constructs a spatially coherent structure by aligning the 2D monocular depth into a globally optimized point cloud. This point cloud serves as the initial state for the centroids of 3D Gaussians. In order to address invisible issues inherent in single-view inputs, we impose semantic and geometric constraints on both synthesized and input camera views as regularizations. These guide the optimization of Gaussians, aiding in the reconstruction of unseen regions. In summary, our method offers a globally consistent 3D scene within a 360 perspective, providing an enhanced immersive experience over existing techniques. Project website at: http://dreamscene360.github.io/
Paper Structure (32 sections, 7 equations, 7 figures, 1 table)

This paper contains 32 sections, 7 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: DreamScene360. We introduce a 3D scene generation pipeline that creates immersive scenes with full 360$^{\circ}$ coverage from text prompts of any level of specificity.
  • Figure 2: Overall Architecture. Beginning with a concise text prompt, we employ a diffusion model to generate a 360$^\circ$ panoramic image. A self-refinement process is employed to produce the optimal 2D candidate panorama. Subsequently, a 3D geometric field is utilized to initialize the Panoramic 3D Gaussians. Throughout this process, both semantic and geometric correspondences are employed as guiding principles for the optimization of the Gaussians, aiming to address and fill the gaps resulting from the single-view input.
  • Figure 3: Diverse Generation. We demonstrate that our generated 3D scenes are diverse in style, consistent in geometry, and highly matched with the simple text inputs.
  • Figure 4: Visual Comparisons. We showcase 360$^\circ$ 3D scene generation. In each row, from left to right, displays novel views as the camera undergoes clockwise rotation in yaw, accompanied by slight random rotations in pitch and random translations. LucidDreamer chung2023luciddreamer hallucinates novel views from a conditioned image (indicated by a red bounding box) but lacks global semantic, stylized, and geometric consistency. In contrast, our method provides complete 360$^\circ$ coverage without any blind spots (black areas in baseline results), and shows globally consistent semantics.
  • Figure 5: Ablation of Self-Refinement. We demonstrate that the self-refinement process greatly enhances the image quality by improving the text prompt. As shown in each row, the image on the left is generated using a simple user prompt, while a prompt augmented by GPT-4V obtains the image on the right. We observe that after the multi-round self-refinement, GPT-4V selects the one panorama with better visual quality, which provides solid support for the immersive 3D scene we further generate.
  • ...and 2 more figures