MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views

Yuedong Chen; Chuanxia Zheng; Haofei Xu; Bohan Zhuang; Andrea Vedaldi; Tat-Jen Cham; Jianfei Cai

MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views

Yuedong Chen, Chuanxia Zheng, Haofei Xu, Bohan Zhuang, Andrea Vedaldi, Tat-Jen Cham, Jianfei Cai

TL;DR

MVSplat360 tackles 360° novel view synthesis from sparse inputs by marrying a geometry-centric 3D Gaussian Splatting backbone with latent-space refinement from a pre-trained Stable Video Diffusion model. The approach renders 3D-consistent coarse geometry and then uses latent Gaussian features to condition diffusion-based appearance refinement, enabling end-to-end training without per-scene optimization. A new DL3DV-10K benchmark demonstrates superior visual quality over state-of-the-art feed-forward methods, with additional validation on RealEstate10K. The method achieves multi-view consistency and plausible content in unobserved regions, while acknowledging limitations in color realism, potential hallucinations, and computational cost.

Abstract

We introduce MVSplat360, a feed-forward approach for 360° novel view synthesis (NVS) of diverse real-world scenes, using only sparse observations. This setting is inherently ill-posed due to minimal overlap among input views and insufficient visual information provided, making it challenging for conventional methods to achieve high-quality results. Our MVSplat360 addresses this by effectively combining geometry-aware 3D reconstruction with temporally consistent video generation. Specifically, it refactors a feed-forward 3D Gaussian Splatting (3DGS) model to render features directly into the latent space of a pre-trained Stable Video Diffusion (SVD) model, where these features then act as pose and visual cues to guide the denoising process and produce photorealistic 3D-consistent views. Our model is end-to-end trainable and supports rendering arbitrary views with as few as 5 sparse input views. To evaluate MVSplat360's performance, we introduce a new benchmark using the challenging DL3DV-10K dataset, where MVSplat360 achieves superior visual quality compared to state-of-the-art methods on wide-sweeping or even 360° NVS tasks. Experiments on the existing benchmark RealEstate10K also confirm the effectiveness of our model. The video results are available on our project page: https://donydchen.github.io/mvsplat360.

MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views

TL;DR

Abstract

MVSplat360: Feed-Forward 360 Scene Synthesis from Sparse Views

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)