Table of Contents
Fetching ...

Novel View Synthesis from A Few Glimpses via Test-Time Natural Video Completion

Yan Xu, Yixing Wang, Stella X. Yu

TL;DR

This work tackles sparse-input novel view synthesis by reframing it as test-time natural video completion and leveraging pretrained video diffusion priors. It introduces a zero-shot, generation-guided pipeline that generates uncertainty-aware pseudo-views between sparse inputs to supervise and densify a 3D-Gaussian Splatting representation, iteratively refining geometry and appearance. A Gaussian primitive densification step and an uncertainty-guided diffusion modulation enable robust reconstruction in under-observed regions without scene-specific training. Experiments across LLFF, DTU, DL3DV, and MipNeRF-360 demonstrate strong performance under extreme sparsity, highlighting practical impact for fast, high-fidelity view synthesis in unconstrained camera paths.

Abstract

Given just a few glimpses of a scene, can you imagine the movie playing out as the camera glides through it? That's the lens we take on \emph{sparse-input novel view synthesis}, not only as filling spatial gaps between widely spaced views, but also as \emph{completing a natural video} unfolding through space. We recast the task as \emph{test-time natural video completion}, using powerful priors from \emph{pretrained video diffusion models} to hallucinate plausible in-between views. Our \emph{zero-shot, generation-guided} framework produces pseudo views at novel camera poses, modulated by an \emph{uncertainty-aware mechanism} for spatial coherence. These synthesized frames densify supervision for \emph{3D Gaussian Splatting} (3D-GS) for scene reconstruction, especially in under-observed regions. An iterative feedback loop lets 3D geometry and 2D view synthesis inform each other, improving both the scene reconstruction and the generated views. The result is coherent, high-fidelity renderings from sparse inputs \emph{without any scene-specific training or fine-tuning}. On LLFF, DTU, DL3DV, and MipNeRF-360, our method significantly outperforms strong 3D-GS baselines under extreme sparsity.

Novel View Synthesis from A Few Glimpses via Test-Time Natural Video Completion

TL;DR

This work tackles sparse-input novel view synthesis by reframing it as test-time natural video completion and leveraging pretrained video diffusion priors. It introduces a zero-shot, generation-guided pipeline that generates uncertainty-aware pseudo-views between sparse inputs to supervise and densify a 3D-Gaussian Splatting representation, iteratively refining geometry and appearance. A Gaussian primitive densification step and an uncertainty-guided diffusion modulation enable robust reconstruction in under-observed regions without scene-specific training. Experiments across LLFF, DTU, DL3DV, and MipNeRF-360 demonstrate strong performance under extreme sparsity, highlighting practical impact for fast, high-fidelity view synthesis in unconstrained camera paths.

Abstract

Given just a few glimpses of a scene, can you imagine the movie playing out as the camera glides through it? That's the lens we take on \emph{sparse-input novel view synthesis}, not only as filling spatial gaps between widely spaced views, but also as \emph{completing a natural video} unfolding through space. We recast the task as \emph{test-time natural video completion}, using powerful priors from \emph{pretrained video diffusion models} to hallucinate plausible in-between views. Our \emph{zero-shot, generation-guided} framework produces pseudo views at novel camera poses, modulated by an \emph{uncertainty-aware mechanism} for spatial coherence. These synthesized frames densify supervision for \emph{3D Gaussian Splatting} (3D-GS) for scene reconstruction, especially in under-observed regions. An iterative feedback loop lets 3D geometry and 2D view synthesis inform each other, improving both the scene reconstruction and the generated views. The result is coherent, high-fidelity renderings from sparse inputs \emph{without any scene-specific training or fine-tuning}. On LLFF, DTU, DL3DV, and MipNeRF-360, our method significantly outperforms strong 3D-GS baselines under extreme sparsity.

Paper Structure

This paper contains 17 sections, 6 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: We view sparse-input novel view synthesis as temporal-spatial completion of a natural-looking video. Left: Our generation-guided reconstruction pipeline. With the initialized 3D-GS from sparse input views, ① we create guidance images on interpolated poses and estimate their uncertainty, based on the currently optimized 3D-GS. ② Using both guidance images and their uncertainties, we modulate the diffusion score function to interpolate between sparse input views. ③ The interpolated views are used to constrain 3D-GS optimization. Right: With our generation-guide reconstruction, the under-observed regions in the inputs are enhanced by the views generated by the diffusion model.
  • Figure 2: Overall framework. After initializing 3D-GS from sparse input images (①), ② we create guidance images (Sec. \ref{['sec:guidance_image']}) and assess their uncertainties (Sec. \ref{['sec:uncertainty']}) based on the current 3D-GS renderings. ③ The guidance images guide the diffusion process through the uncertainty-aware modulation (Sec. \ref{['sec:reverse_sample_modulation']}). The diffusion process enhances high-uncertain regions while preserving reliable parts. ④ The generated pseudo-view images are then used to densify the Gaussian primitives (Sec. \ref{['sec:gs_densification']}) and to constrain the 3D-GS training (Sec. \ref{['sec:refinement']}). For illustration, we show pseudo-view generation from one image pair, though all pairs are processed sequentially in practice.
  • Figure 3: Cross-view consistency is evaluated through the forward and backward projections shown in (a) to estimate the uncertainty of the generated guidance image. As illustrated in (b), regions exhibiting poor cross-view consistency (regions in the boxes) are identified as high-uncertainty areas (brighter), which are subsequently refined by the video diffusion model.
  • Figure 4: Qualitative comparison with existing methods on the DL3DV dataset demonstrates the robustness of our methods against sparse inputs. Leveraging the priors of the video diffusion model, our method renders photorealistic novel views from only 9 input views, while other methods produce noisier, less realistic results.
  • Figure 5: Qualitative comparison with other methods on DTU and LLFF datasets.
  • ...and 3 more figures