Table of Contents
Fetching ...

MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views

Yuedong Chen, Chuanxia Zheng, Haofei Xu, Bohan Zhuang, Andrea Vedaldi, Tat-Jen Cham, Jianfei Cai

TL;DR

MVSplat360 tackles 360° novel view synthesis from sparse inputs by marrying a geometry-centric 3D Gaussian Splatting backbone with latent-space refinement from a pre-trained Stable Video Diffusion model. The approach renders 3D-consistent coarse geometry and then uses latent Gaussian features to condition diffusion-based appearance refinement, enabling end-to-end training without per-scene optimization. A new DL3DV-10K benchmark demonstrates superior visual quality over state-of-the-art feed-forward methods, with additional validation on RealEstate10K. The method achieves multi-view consistency and plausible content in unobserved regions, while acknowledging limitations in color realism, potential hallucinations, and computational cost.

Abstract

We introduce MVSplat360, a feed-forward approach for 360° novel view synthesis (NVS) of diverse real-world scenes, using only sparse observations. This setting is inherently ill-posed due to minimal overlap among input views and insufficient visual information provided, making it challenging for conventional methods to achieve high-quality results. Our MVSplat360 addresses this by effectively combining geometry-aware 3D reconstruction with temporally consistent video generation. Specifically, it refactors a feed-forward 3D Gaussian Splatting (3DGS) model to render features directly into the latent space of a pre-trained Stable Video Diffusion (SVD) model, where these features then act as pose and visual cues to guide the denoising process and produce photorealistic 3D-consistent views. Our model is end-to-end trainable and supports rendering arbitrary views with as few as 5 sparse input views. To evaluate MVSplat360's performance, we introduce a new benchmark using the challenging DL3DV-10K dataset, where MVSplat360 achieves superior visual quality compared to state-of-the-art methods on wide-sweeping or even 360° NVS tasks. Experiments on the existing benchmark RealEstate10K also confirm the effectiveness of our model. The video results are available on our project page: https://donydchen.github.io/mvsplat360.

MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views

TL;DR

MVSplat360 tackles 360° novel view synthesis from sparse inputs by marrying a geometry-centric 3D Gaussian Splatting backbone with latent-space refinement from a pre-trained Stable Video Diffusion model. The approach renders 3D-consistent coarse geometry and then uses latent Gaussian features to condition diffusion-based appearance refinement, enabling end-to-end training without per-scene optimization. A new DL3DV-10K benchmark demonstrates superior visual quality over state-of-the-art feed-forward methods, with additional validation on RealEstate10K. The method achieves multi-view consistency and plausible content in unobserved regions, while acknowledging limitations in color realism, potential hallucinations, and computational cost.

Abstract

We introduce MVSplat360, a feed-forward approach for 360° novel view synthesis (NVS) of diverse real-world scenes, using only sparse observations. This setting is inherently ill-posed due to minimal overlap among input views and insufficient visual information provided, making it challenging for conventional methods to achieve high-quality results. Our MVSplat360 addresses this by effectively combining geometry-aware 3D reconstruction with temporally consistent video generation. Specifically, it refactors a feed-forward 3D Gaussian Splatting (3DGS) model to render features directly into the latent space of a pre-trained Stable Video Diffusion (SVD) model, where these features then act as pose and visual cues to guide the denoising process and produce photorealistic 3D-consistent views. Our model is end-to-end trainable and supports rendering arbitrary views with as few as 5 sparse input views. To evaluate MVSplat360's performance, we introduce a new benchmark using the challenging DL3DV-10K dataset, where MVSplat360 achieves superior visual quality compared to state-of-the-art methods on wide-sweeping or even 360° NVS tasks. Experiments on the existing benchmark RealEstate10K also confirm the effectiveness of our model. The video results are available on our project page: https://donydchen.github.io/mvsplat360.

Paper Structure

This paper contains 18 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Examples of our MVSplat360. Given sparse and wide-baseline observations of diverse in-the-wild scenes, MVSplat360 can directly render 360° novel views (inward or outward facing) or other natural camera trajectory views in a feed-forward manner, without any per-scene optimization.
  • Figure 2: Overview of our MVSplat360. (a) Given sparse posed images as input, we first match and fuse the multi-view information using a multi-view Transformer and cost volume-based encoder. (b) Next, a 3DGS representation is constructed to represent the coarse geometry of the entire scene. (c) Considering such coarse reconstruction is imperfect, we further adapt a pre-trained SVD, using features rendered from the 3DGS representation as conditions to achieve 360° novel view synthesis.
  • Figure 3: Qualitative comparisons on DL3DV-10K. MVSplat360 shows significant improvement compared to existing SoTA models. Here, we showcase with a rich mix of diversity and complexity, including indoor (bounded) vs. outdoor (unbounded), high vs. low texture frequency, more vs. less reflection, and more vs. less transparency. More results are provided in \ref{['sec:app_visual']}.
  • Figure 4: Qualitative comparisons on RealEstate10K. MVSplat360 shows reasonable generations for disoccluded and unobserved regions, while latentSplat wewer2024latentsplat fills in content with artifacts.
  • Figure 5: SfM on input and rendered views. Images with red borders are the input views, while others are rendered by our MVSplat360. The reasonably recovered camera poses and 3D point clouds via VGGSfM imply that our outputs are multi-view consistent and geometrically correct.
  • ...and 4 more figures