Table of Contents
Fetching ...

SceneDreamer360: Text-Driven 3D-Consistent Scene Generation with Panoramic Gaussian Splatting

Wenrui Li, Fucheng Cai, Yapeng Mi, Zhe Yang, Wangmeng Zuo, Xingtao Wang, Xiaopeng Fan

TL;DR

This paper tackles the challenge of text-driven 3D scene generation by introducing SceneDreamer360, which creates a spatially consistent 3D scene from text prompts using a high-resolution panoramic prior and 3D Gaussian Splatting. It enhances PanFusion with an MLP and LoRA, reannotates data with complex long texts, and applies a three-stage panoramic super-resolution pipeline to produce 6K panoramas that guide accurate point-cloud reconstruction. The method delivers higher-quality, more spatially coherent 3D scenes compared with baselines, validated by both qualitative visuals and quantitative metrics. This approach advances scalable, domain-generalized text-to-3D generation with efficient rendering and detailed geometry.

Abstract

Text-driven 3D scene generation has seen significant advancements recently. However, most existing methods generate single-view images using generative models and then stitch them together in 3D space. This independent generation for each view often results in spatial inconsistency and implausibility in the 3D scenes. To address this challenge, we proposed a novel text-driven 3D-consistent scene generation model: SceneDreamer360. Our proposed method leverages a text-driven panoramic image generation model as a prior for 3D scene generation and employs 3D Gaussian Splatting (3DGS) to ensure consistency across multi-view panoramic images. Specifically, SceneDreamer360 enhances the fine-tuned Panfusion generator with a three-stage panoramic enhancement, enabling the generation of high-resolution, detail-rich panoramic images. During the 3D scene construction, a novel point cloud fusion initialization method is used, producing higher quality and spatially consistent point clouds. Our extensive experiments demonstrate that compared to other methods, SceneDreamer360 with its panoramic image generation and 3DGS can produce higher quality, spatially consistent, and visually appealing 3D scenes from any text prompt. Our codes are available at \url{https://github.com/liwrui/SceneDreamer360}.

SceneDreamer360: Text-Driven 3D-Consistent Scene Generation with Panoramic Gaussian Splatting

TL;DR

This paper tackles the challenge of text-driven 3D scene generation by introducing SceneDreamer360, which creates a spatially consistent 3D scene from text prompts using a high-resolution panoramic prior and 3D Gaussian Splatting. It enhances PanFusion with an MLP and LoRA, reannotates data with complex long texts, and applies a three-stage panoramic super-resolution pipeline to produce 6K panoramas that guide accurate point-cloud reconstruction. The method delivers higher-quality, more spatially coherent 3D scenes compared with baselines, validated by both qualitative visuals and quantitative metrics. This approach advances scalable, domain-generalized text-to-3D generation with efficient rendering and detailed geometry.

Abstract

Text-driven 3D scene generation has seen significant advancements recently. However, most existing methods generate single-view images using generative models and then stitch them together in 3D space. This independent generation for each view often results in spatial inconsistency and implausibility in the 3D scenes. To address this challenge, we proposed a novel text-driven 3D-consistent scene generation model: SceneDreamer360. Our proposed method leverages a text-driven panoramic image generation model as a prior for 3D scene generation and employs 3D Gaussian Splatting (3DGS) to ensure consistency across multi-view panoramic images. Specifically, SceneDreamer360 enhances the fine-tuned Panfusion generator with a three-stage panoramic enhancement, enabling the generation of high-resolution, detail-rich panoramic images. During the 3D scene construction, a novel point cloud fusion initialization method is used, producing higher quality and spatially consistent point clouds. Our extensive experiments demonstrate that compared to other methods, SceneDreamer360 with its panoramic image generation and 3DGS can produce higher quality, spatially consistent, and visually appealing 3D scenes from any text prompt. Our codes are available at \url{https://github.com/liwrui/SceneDreamer360}.
Paper Structure (27 sections, 15 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 27 sections, 15 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: We introduce SceneDreamer360, a text-based 3D scene generation framework designed to create realistic 3D scenes with high consistency across different viewpoints. SceneDreamer360 consists of two stages. In the first stage, the Panorama Generation module creates enhanced panoramic images to provide a complete and consistent spatial prior for the 3D scene. In the second stage, 3D Gaussian Splatting is used to reconstruct multi-view spatial images, resulting in a complete and spatially consistent point cloud.
  • Figure 2: The architecture of the SceneDreamer360. SceneDreamer360 generates an initial panorama from any open-world textual description using the fine-tuned PanFusion model. This panorama is then enhanced to produce a high-resolution $3072 \times 6144$ image. In the second stage, multi-view algorithms generate multi-view images, and a monocular depth estimation model provides depth maps for initial point cloud fusion. Finally, 3D Gaussian Splatting is used to reconstruct and render the point cloud, resulting in a complete and consistent 3D scene.
  • Figure 3: The visualization comparison of the proposed SceneDreamer360 between current methods. The images on the left display the new viewpoint renderings, demonstrating how each method handles various perspectives. In contrast, the images on the right present the spatial views of the generated 3D scenes, allowing a direct comparison of the spatial consistency and overall structure of the reconstructions.
  • Figure 4: The Panoramic Image on the left represents the high-resolution enhanced panorama generated from a specific text input, showcasing the detailed and cohesive scene captured in 360 degrees. The Multi-view Image in the center displays various perspective images derived from this panorama, illustrating the scene's adaptability across different viewpoints. Finally, the Output on the right presents images rendered from the generated point cloud at new viewpoints, highlighting SceneDreamer360's ability to maintain high quality and spatial consistency in the 3D reconstructions. As shown in the figure, SceneDreamer360 effectively produces detailed, visually coherent, and spatially consistent 3D scenes.
  • Figure 5: The ablation study of SceneDreamer360.