Table of Contents
Fetching ...

Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model

Shengjun Zhang, Jinzhao Li, Xin Fei, Hao Liu, Yueqi Duan

TL;DR

Scene Splatter presents a momentum-based framework for generating 3D scenes from a single image by marrying 3D Gaussian Splatting with diffusion priors. It introduces latent-level momentum, computed per latent feature with a coefficient λ, and a pixel-level momentum that blends momentum-augmented and non-momentum video renders, enabling both detail and consistency. Through iterative Gaussian refinement along a predefined camera trajectory, the method achieves higher fidelity and scene coherence than regression- and generation-based baselines, as demonstrated on RealEstate10K with favorable PSNR, SSIM, and LPIPS scores. The approach offers a practical pathway to extended, consistent 3D view synthesis from monocular input, while acknowledging computational overhead and current focus on static scenes, with future work aimed at efficiency and 4D scene generation.

Abstract

In this paper, we propose Scene Splatter, a momentum-based paradigm for video diffusion to generate generic scenes from single image. Existing methods, which employ video generation models to synthesize novel views, suffer from limited video length and scene inconsistency, leading to artifacts and distortions during further reconstruction. To address this issue, we construct noisy samples from original features as momentum to enhance video details and maintain scene consistency. However, for latent features with the perception field that spans both known and unknown regions, such latent-level momentum restricts the generative ability of video diffusion in unknown regions. Therefore, we further introduce the aforementioned consistent video as a pixel-level momentum to a directly generated video without momentum for better recovery of unseen regions. Our cascaded momentum enables video diffusion models to generate both high-fidelity and consistent novel views. We further finetune the global Gaussian representations with enhanced frames and render new frames for momentum update in the next step. In this manner, we can iteratively recover a 3D scene, avoiding the limitation of video length. Extensive experiments demonstrate the generalization capability and superior performance of our method in high-fidelity and consistent scene generation.

Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model

TL;DR

Scene Splatter presents a momentum-based framework for generating 3D scenes from a single image by marrying 3D Gaussian Splatting with diffusion priors. It introduces latent-level momentum, computed per latent feature with a coefficient λ, and a pixel-level momentum that blends momentum-augmented and non-momentum video renders, enabling both detail and consistency. Through iterative Gaussian refinement along a predefined camera trajectory, the method achieves higher fidelity and scene coherence than regression- and generation-based baselines, as demonstrated on RealEstate10K with favorable PSNR, SSIM, and LPIPS scores. The approach offers a practical pathway to extended, consistent 3D view synthesis from monocular input, while acknowledging computational overhead and current focus on static scenes, with future work aimed at efficiency and 4D scene generation.

Abstract

In this paper, we propose Scene Splatter, a momentum-based paradigm for video diffusion to generate generic scenes from single image. Existing methods, which employ video generation models to synthesize novel views, suffer from limited video length and scene inconsistency, leading to artifacts and distortions during further reconstruction. To address this issue, we construct noisy samples from original features as momentum to enhance video details and maintain scene consistency. However, for latent features with the perception field that spans both known and unknown regions, such latent-level momentum restricts the generative ability of video diffusion in unknown regions. Therefore, we further introduce the aforementioned consistent video as a pixel-level momentum to a directly generated video without momentum for better recovery of unseen regions. Our cascaded momentum enables video diffusion models to generate both high-fidelity and consistent novel views. We further finetune the global Gaussian representations with enhanced frames and render new frames for momentum update in the next step. In this manner, we can iteratively recover a 3D scene, avoiding the limitation of video length. Extensive experiments demonstrate the generalization capability and superior performance of our method in high-fidelity and consistent scene generation.

Paper Structure

This paper contains 14 sections, 20 equations, 7 figures, 2 tables, 2 algorithms.

Figures (7)

  • Figure 1: Visualization results of Flash3D Flash3D2024NIPS, CogVideoX CogVideoX2024arXiv, ViewCrafter ViewCrafter2024arXiv and ours. Flash3D Flash3D2024NIPS suffers from distortions and occlusions, while CogVideoX CogVideoX2024arXiv and ViewCrafter ViewCrafter2024arXiv change the color style or existing components, compared to the input image. Our method can generate high fidelity and consistent 3D scene with our cascaded momentum.
  • Figure 2: The pipeline of Scene Splatter. We initialize the Gaussian representations from the input image $I_{0}$ with a Gaussian Predictor Flash3D2024NIPS. For each iteration, we first render the video $\mathcal{I}$ from 3D Gaussians $\mathcal{G}$. Then, we generate the enhanced video $\Phi_{\lambda}(\mathcal{I})$ with latent-level momentum and $\Phi_{0}(\mathcal{I})$ directly from the vanilla diffusion model, where $\Phi_{\lambda}$ and $\Phi_{0}$ share the same weights of the denoising network. We further render scale maps as pixel-level momentum coefficient to further enhance the generated frames. We use the final results to supervise the optimization of Gaussian representations. We conduct this process along the camera trajectory to iteratively recover 3D scenes.
  • Figure 3: Visualization of vanilla video diffusion model ViewCrafter2024arXiv and our two level momentum. We observe that latent-level momentum can enhance details (red box) while maintaining consistency (blue box). Yet, such momentum limits the generation ability for unseen regions (green box). Motivated by this observation, we further propose pixel-level momentum to benefit from both (c) and (d).
  • Figure 4: Qualitative comparison in 3D scene generation from single view. Given various kinds of input views, our method produces high fidelity and consistent 3D scenes.
  • Figure 5: Visualization of rendering results in each iteration. The inconsistency in CogVideoX CogVideoX2024arXiv and ViewCrafter ViewCrafter2024arXiv gradually increases. Our method can maintain high consistency during the iterative reconstruction process.
  • ...and 2 more figures