Table of Contents
Fetching ...

ExScene: Free-View 3D Scene Reconstruction with Gaussian Splatting from a Single Image

Tianyi Gong, Boyan Li, Yifei Zhong, Fangxin Wang

TL;DR

ExScene designs a novel multimodal diffusion model to generate a high-fidelity and globally consistent panoramic image, and develops a panoramic depth estimation approach to calculate geometric information from panorama, and combines geometric information with high-fidelity panoramic image to train an initial 3D Gaussian Splatting model.

Abstract

The increasing demand for augmented and virtual reality applications has highlighted the importance of crafting immersive 3D scenes from a simple single-view image. However, due to the partial priors provided by single-view input, existing methods are often limited to reconstruct low-consistency 3D scenes with narrow fields of view from single-view input. These limitations make them less capable of generalizing to reconstruct immersive scenes. To address this problem, we propose ExScene, a two-stage pipeline to reconstruct an immersive 3D scene from any given single-view image. ExScene designs a novel multimodal diffusion model to generate a high-fidelity and globally consistent panoramic image. We then develop a panoramic depth estimation approach to calculate geometric information from panorama, and we combine geometric information with high-fidelity panoramic image to train an initial 3D Gaussian Splatting (3DGS) model. Following this, we introduce a GS refinement technique with 2D stable video diffusion priors. We add camera trajectory consistency and color-geometric priors into the denoising process of diffusion to improve color and spatial consistency across image sequences. These refined sequences are then used to fine-tune the initial 3DGS model, leading to better reconstruction quality. Experimental results demonstrate that our ExScene achieves consistent and immersive scene reconstruction using only single-view input, significantly surpassing state-of-the-art baselines.

ExScene: Free-View 3D Scene Reconstruction with Gaussian Splatting from a Single Image

TL;DR

ExScene designs a novel multimodal diffusion model to generate a high-fidelity and globally consistent panoramic image, and develops a panoramic depth estimation approach to calculate geometric information from panorama, and combines geometric information with high-fidelity panoramic image to train an initial 3D Gaussian Splatting model.

Abstract

The increasing demand for augmented and virtual reality applications has highlighted the importance of crafting immersive 3D scenes from a simple single-view image. However, due to the partial priors provided by single-view input, existing methods are often limited to reconstruct low-consistency 3D scenes with narrow fields of view from single-view input. These limitations make them less capable of generalizing to reconstruct immersive scenes. To address this problem, we propose ExScene, a two-stage pipeline to reconstruct an immersive 3D scene from any given single-view image. ExScene designs a novel multimodal diffusion model to generate a high-fidelity and globally consistent panoramic image. We then develop a panoramic depth estimation approach to calculate geometric information from panorama, and we combine geometric information with high-fidelity panoramic image to train an initial 3D Gaussian Splatting (3DGS) model. Following this, we introduce a GS refinement technique with 2D stable video diffusion priors. We add camera trajectory consistency and color-geometric priors into the denoising process of diffusion to improve color and spatial consistency across image sequences. These refined sequences are then used to fine-tune the initial 3DGS model, leading to better reconstruction quality. Experimental results demonstrate that our ExScene achieves consistent and immersive scene reconstruction using only single-view input, significantly surpassing state-of-the-art baselines.

Paper Structure

This paper contains 14 sections, 9 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: ExScene is a novel reconstruction method that extends and reconstructs any given single narrow-view image into an immersive 360-degree 3D scene based on 3D Gaussian Splatting (3DGS).
  • Figure 2: Overview of ExScene.Stage 1: Generate initialized Gaussian scenes. Given a single-view image as input, ExScene introduces a 2D multimodal diffusion model with panoramic priors to generate a high-quality and consistent panoramic image (Section \ref{['subsec:A']}). Subsequently, ExScene designs a depth estimation module with distortion-elimination regularization to calculate geometric parameters of the generated panorama (Section \ref{['subsec:B']}). The depth map is then combined with the panoramic image to train an initial 3DGS (Section \ref{['subsec:C']}). Stage 2: Fine-tune initialized 3D Gaussians. To address the missing regions and geometric distortions present in the initial 3DGS, we design an innovative Patching Module that consists of an SVD model fusing camera trajectory consistency and image color-geometric priors. Firstly, ExScene simulates virtual camera trajectories over the initial 3DGS to obtain rendered image sequences. These sequences are then refined using the patching module, resulting in high-fidelity and consistent repaired views. Finally, we use these repaired sequences to fine-tune initial 3DGS, improving the reconstruction quality (Section \ref{['subsec:D']}).
  • Figure 3: Qualitative comparison of our method with ViewExtrapolator liu2024novel and VistaDream wang2024vistadream. Given a single-view input image, (a) Vistadream first outpaints the image and reconstructs a Gaussian scaffold, then uses a consistent diffusion model to fine-tune the scaffold. (b) ViewExtrapolator employs SVD model generation priors to achieve realistic new view extrapolation. (c) ExScene (Ours) design a two-stage framework for single-view reconstruction.