Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs

Yingji Zhong; Zhihao Li; Dave Zhenyu Chen; Lanqing Hong; Dan Xu

Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs

Yingji Zhong, Zhihao Li, Dave Zhenyu Chen, Lanqing Hong, Dan Xu

TL;DR

To address extrapolation and occlusion in sparse-input 3D Gaussian Splatting (3DGS), this paper proposes a reconstruction-by-generation framework that leverages video diffusion priors. It introduces a training-free scene-grounding guidance that tames the diffusion model to generate consistent, scene-grounded sequences, guided by renders from an optimized 3DGS, plus a trajectory initialization strategy and a tailored 3DGS optimization scheme for generated sequences. The approach yields large PSNR gains over baselines on Replica and ScanNet++ (e.g., over $3.0$ dB on Replica and over $2.5$ dB on ScanNet++), and improves geometry and artifact suppression. This enables holistic scene modeling from sparse inputs with practical impact for real-world NVS and 3D reconstruction.

Abstract

Despite recent successes in novel view synthesis using 3D Gaussian Splatting (3DGS), modeling scenes with sparse inputs remains a challenge. In this work, we address two critical yet overlooked issues in real-world sparse-input modeling: extrapolation and occlusion. To tackle these issues, we propose to use a reconstruction by generation pipeline that leverages learned priors from video diffusion models to provide plausible interpretations for regions outside the field of view or occluded. However, the generated sequences exhibit inconsistencies that do not fully benefit subsequent 3DGS modeling. To address the challenge of inconsistencies, we introduce a novel scene-grounding guidance based on rendered sequences from an optimized 3DGS, which tames the diffusion model to generate consistent sequences. This guidance is training-free and does not require any fine-tuning of the diffusion model. To facilitate holistic scene modeling, we also propose a trajectory initialization method. It effectively identifies regions that are outside the field of view and occluded. We further design a scheme tailored for 3DGS optimization with generated sequences. Experiments demonstrate that our method significantly improves upon the baseline and achieves state-of-the-art performance on challenging benchmarks.

Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs

TL;DR

Abstract

Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)