Table of Contents
Fetching ...

FastScene: Text-Driven Fast 3D Indoor Scene Generation via Panoramic Gaussian Splatting

Yikun Ma, Dandan Zhan, Zhi Jin

TL;DR

FastScene tackles the challenge of fast, coherent text-to-3D indoor scene generation by leveraging a panorama-first pipeline that couples diffusion-based panorama generation with depth estimation, coarse view synthesis, progressive cubemap inpainting, and panoramic-to-perspective projection before reconstruction with 3D Gaussian Splatting. Key innovations include CVS for hole-filled panorama synthesis, PNVI for controllable progressive inpainting, and MVP to bridge panoramas with COLMAP-compatible views for efficient 3DGS reconstruction. The approach achieves superior generation speed and scene consistency compared to state-of-the-art methods, producing a complete 3D scene in about 15 minutes from a text prompt. Extensive experiments on indoor datasets and panoramic extensions demonstrate robust quality, adaptability to existing panoramas, and strong ablations validating the effectiveness of the proposed components.

Abstract

Text-driven 3D indoor scene generation holds broad applications, ranging from gaming and smart homes to AR/VR applications. Fast and high-fidelity scene generation is paramount for ensuring user-friendly experiences. However, existing methods are characterized by lengthy generation processes or necessitate the intricate manual specification of motion parameters, which introduces inconvenience for users. Furthermore, these methods often rely on narrow-field viewpoint iterative generations, compromising global consistency and overall scene quality. To address these issues, we propose FastScene, a framework for fast and higher-quality 3D scene generation, while maintaining the scene consistency. Specifically, given a text prompt, we generate a panorama and estimate its depth, since the panorama encompasses information about the entire scene and exhibits explicit geometric constraints. To obtain high-quality novel views, we introduce the Coarse View Synthesis (CVS) and Progressive Novel View Inpainting (PNVI) strategies, ensuring both scene consistency and view quality. Subsequently, we utilize Multi-View Projection (MVP) to form perspective views, and apply 3D Gaussian Splatting (3DGS) for scene reconstruction. Comprehensive experiments demonstrate FastScene surpasses other methods in both generation speed and quality with better scene consistency. Notably, guided only by a text prompt, FastScene can generate a 3D scene within a mere 15 minutes, which is at least one hour faster than state-of-the-art methods, making it a paradigm for user-friendly scene generation.

FastScene: Text-Driven Fast 3D Indoor Scene Generation via Panoramic Gaussian Splatting

TL;DR

FastScene tackles the challenge of fast, coherent text-to-3D indoor scene generation by leveraging a panorama-first pipeline that couples diffusion-based panorama generation with depth estimation, coarse view synthesis, progressive cubemap inpainting, and panoramic-to-perspective projection before reconstruction with 3D Gaussian Splatting. Key innovations include CVS for hole-filled panorama synthesis, PNVI for controllable progressive inpainting, and MVP to bridge panoramas with COLMAP-compatible views for efficient 3DGS reconstruction. The approach achieves superior generation speed and scene consistency compared to state-of-the-art methods, producing a complete 3D scene in about 15 minutes from a text prompt. Extensive experiments on indoor datasets and panoramic extensions demonstrate robust quality, adaptability to existing panoramas, and strong ablations validating the effectiveness of the proposed components.

Abstract

Text-driven 3D indoor scene generation holds broad applications, ranging from gaming and smart homes to AR/VR applications. Fast and high-fidelity scene generation is paramount for ensuring user-friendly experiences. However, existing methods are characterized by lengthy generation processes or necessitate the intricate manual specification of motion parameters, which introduces inconvenience for users. Furthermore, these methods often rely on narrow-field viewpoint iterative generations, compromising global consistency and overall scene quality. To address these issues, we propose FastScene, a framework for fast and higher-quality 3D scene generation, while maintaining the scene consistency. Specifically, given a text prompt, we generate a panorama and estimate its depth, since the panorama encompasses information about the entire scene and exhibits explicit geometric constraints. To obtain high-quality novel views, we introduce the Coarse View Synthesis (CVS) and Progressive Novel View Inpainting (PNVI) strategies, ensuring both scene consistency and view quality. Subsequently, we utilize Multi-View Projection (MVP) to form perspective views, and apply 3D Gaussian Splatting (3DGS) for scene reconstruction. Comprehensive experiments demonstrate FastScene surpasses other methods in both generation speed and quality with better scene consistency. Notably, guided only by a text prompt, FastScene can generate a 3D scene within a mere 15 minutes, which is at least one hour faster than state-of-the-art methods, making it a paradigm for user-friendly scene generation.
Paper Structure (17 sections, 9 equations, 8 figures, 4 tables)

This paper contains 17 sections, 9 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: The framework of our FastScene. Given the text prompt, we first generate a panorama and estimate its depth. Then, we iteratively generate multi-view panoramas through PNVI. We introduce MVP for perspective projection and use 3DGS for scene reconstruction.
  • Figure 2: Given a new camera pose $P_n$, the calculation for movement in spherical coordinates.
  • Figure 3: Illustration of progressive inpainting and movement.
  • Figure 4: The visual comparison of the original COLAMP output and our projection for panoramic input. It is evident that our method is capable of obtaining accurate point clouds and camera poses.
  • Figure 5: Qualitative comparisons with other methods. For each methods, we show the rendering views for the 1st and 5th frames. Our method generats high-quality scenes from the same text prompts, while maintaining the scene consistency well.
  • ...and 3 more figures