Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model
Shengjun Zhang, Jinzhao Li, Xin Fei, Hao Liu, Yueqi Duan
TL;DR
Scene Splatter presents a momentum-based framework for generating 3D scenes from a single image by marrying 3D Gaussian Splatting with diffusion priors. It introduces latent-level momentum, computed per latent feature with a coefficient λ, and a pixel-level momentum that blends momentum-augmented and non-momentum video renders, enabling both detail and consistency. Through iterative Gaussian refinement along a predefined camera trajectory, the method achieves higher fidelity and scene coherence than regression- and generation-based baselines, as demonstrated on RealEstate10K with favorable PSNR, SSIM, and LPIPS scores. The approach offers a practical pathway to extended, consistent 3D view synthesis from monocular input, while acknowledging computational overhead and current focus on static scenes, with future work aimed at efficiency and 4D scene generation.
Abstract
In this paper, we propose Scene Splatter, a momentum-based paradigm for video diffusion to generate generic scenes from single image. Existing methods, which employ video generation models to synthesize novel views, suffer from limited video length and scene inconsistency, leading to artifacts and distortions during further reconstruction. To address this issue, we construct noisy samples from original features as momentum to enhance video details and maintain scene consistency. However, for latent features with the perception field that spans both known and unknown regions, such latent-level momentum restricts the generative ability of video diffusion in unknown regions. Therefore, we further introduce the aforementioned consistent video as a pixel-level momentum to a directly generated video without momentum for better recovery of unseen regions. Our cascaded momentum enables video diffusion models to generate both high-fidelity and consistent novel views. We further finetune the global Gaussian representations with enhanced frames and render new frames for momentum update in the next step. In this manner, we can iteratively recover a 3D scene, avoiding the limitation of video length. Extensive experiments demonstrate the generalization capability and superior performance of our method in high-fidelity and consistent scene generation.
