Table of Contents
Fetching ...

Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs

Yingji Zhong, Zhihao Li, Dave Zhenyu Chen, Lanqing Hong, Dan Xu

TL;DR

To address extrapolation and occlusion in sparse-input 3D Gaussian Splatting (3DGS), this paper proposes a reconstruction-by-generation framework that leverages video diffusion priors. It introduces a training-free scene-grounding guidance that tames the diffusion model to generate consistent, scene-grounded sequences, guided by renders from an optimized 3DGS, plus a trajectory initialization strategy and a tailored 3DGS optimization scheme for generated sequences. The approach yields large PSNR gains over baselines on Replica and ScanNet++ (e.g., over $3.0$ dB on Replica and over $2.5$ dB on ScanNet++), and improves geometry and artifact suppression. This enables holistic scene modeling from sparse inputs with practical impact for real-world NVS and 3D reconstruction.

Abstract

Despite recent successes in novel view synthesis using 3D Gaussian Splatting (3DGS), modeling scenes with sparse inputs remains a challenge. In this work, we address two critical yet overlooked issues in real-world sparse-input modeling: extrapolation and occlusion. To tackle these issues, we propose to use a reconstruction by generation pipeline that leverages learned priors from video diffusion models to provide plausible interpretations for regions outside the field of view or occluded. However, the generated sequences exhibit inconsistencies that do not fully benefit subsequent 3DGS modeling. To address the challenge of inconsistencies, we introduce a novel scene-grounding guidance based on rendered sequences from an optimized 3DGS, which tames the diffusion model to generate consistent sequences. This guidance is training-free and does not require any fine-tuning of the diffusion model. To facilitate holistic scene modeling, we also propose a trajectory initialization method. It effectively identifies regions that are outside the field of view and occluded. We further design a scheme tailored for 3DGS optimization with generated sequences. Experiments demonstrate that our method significantly improves upon the baseline and achieves state-of-the-art performance on challenging benchmarks.

Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs

TL;DR

To address extrapolation and occlusion in sparse-input 3D Gaussian Splatting (3DGS), this paper proposes a reconstruction-by-generation framework that leverages video diffusion priors. It introduces a training-free scene-grounding guidance that tames the diffusion model to generate consistent, scene-grounded sequences, guided by renders from an optimized 3DGS, plus a trajectory initialization strategy and a tailored 3DGS optimization scheme for generated sequences. The approach yields large PSNR gains over baselines on Replica and ScanNet++ (e.g., over dB on Replica and over dB on ScanNet++), and improves geometry and artifact suppression. This enables holistic scene modeling from sparse inputs with practical impact for real-world NVS and 3D reconstruction.

Abstract

Despite recent successes in novel view synthesis using 3D Gaussian Splatting (3DGS), modeling scenes with sparse inputs remains a challenge. In this work, we address two critical yet overlooked issues in real-world sparse-input modeling: extrapolation and occlusion. To tackle these issues, we propose to use a reconstruction by generation pipeline that leverages learned priors from video diffusion models to provide plausible interpretations for regions outside the field of view or occluded. However, the generated sequences exhibit inconsistencies that do not fully benefit subsequent 3DGS modeling. To address the challenge of inconsistencies, we introduce a novel scene-grounding guidance based on rendered sequences from an optimized 3DGS, which tames the diffusion model to generate consistent sequences. This guidance is training-free and does not require any fine-tuning of the diffusion model. To facilitate holistic scene modeling, we also propose a trajectory initialization method. It effectively identifies regions that are outside the field of view and occluded. We further design a scheme tailored for 3DGS optimization with generated sequences. Experiments demonstrate that our method significantly improves upon the baseline and achieves state-of-the-art performance on challenging benchmarks.

Paper Structure

This paper contains 16 sections, 9 equations, 13 figures, 7 tables, 2 algorithms.

Figures (13)

  • Figure 1: We tackle the critical issues of (a) extrapolation and (b) occlusion in sparse-input 3DGS by leveraging a video diffusion model. Vanilla generation often suffers from inconsistencies within the generated sequences (as highlighted by the yellow arrows), leading to black shadows in the rendered images. In contrast, our scene-grounding generation produces consistent sequences, effectively addressing these issues and enhancing overall quality (c), as indicated by the blue boxes. The numbers refer to PSNR values. Zoom in for better visualization.
  • Figure 2: Framework overview of our proposed method. It consists of three parts: scene-grounding guidance, trajectory initialization, and optimization scheme with generated sequences. Initially, a baseline 3DGS is trained using sparse inputs and initialized with the point cloud from DUSt3R wang2024dust3r. Yellow regions denote uncovered areas, e.g., those outside the field of view or occluded. The trajectory initialization determines the paths for sequence generation based on renderings from the baseline 3DGS, facilitating holistic scene modeling. The video diffusion model receives an input image along with the trajectory for sequence generation, incorporating scene-grounding guidance during the denoising process to ensure consistent output. The guidance is based on the rendered sequences. Finally, the generated sequences are utilized to optimize the final 3DGS through a tailored optimization scheme.
  • Figure 3: Illustration of the proposed trajectory initialization strategy. The yellow parts represent unobserved regions. For each input view, we sample a set of candidate poses around it, and render at these poses using an optimized 3DGS. We select candidate poses whose renderings exhibit significant holes (highlighted by red boxes), and interpolate trajectories between these candidate poses and the input view's pose.
  • Figure 4: Sequences from the vanilla generation suffer from inconsistencies. A 3DGS model optimized with these sequences renders images with black shadows, highlighted by red boxes, while our method solves this issue with the scene-grounding guidance.
  • Figure 5: Qualitative comparisons on the Replica and ScanNet++ datasets. All 3DGS-based methods are optimized using the initialized point cloud from DUSt3R wang2024dust3r. Our method effectively addresses the issues of extrapolation and occlusion while preserving finer details and reducing artifacts. For better visualization, please zoom in on the results.
  • ...and 8 more figures