Table of Contents
Fetching ...

GaussVideoDreamer: 3D Scene Generation with Video Diffusion and Inconsistency-Aware Gaussian Splatting

Junlin Hao, Peiheng Wang, Haoyang Wang, Xinggong Zhang, Zongming Guo

TL;DR

GaussVideoDreamer is presented, which advances generative multimedia approaches by bridging the gap between image, video, and 3D generation, integrating their strengths through two key innovations: a progressive video inpainting strategy that harnesses temporal coherence for improved multiview consistency and faster convergence and a 3D Gaussian Splatting consistency mask.

Abstract

Single-image 3D scene reconstruction presents significant challenges due to its inherently ill-posed nature and limited input constraints. Recent advances have explored two promising directions: multiview generative models that train on 3D consistent datasets but struggle with out-of-distribution generalization, and 3D scene inpainting and completion frameworks that suffer from cross-view inconsistency and suboptimal error handling, as they depend exclusively on depth data or 3D smoothness, which ultimately degrades output quality and computational performance. Building upon these approaches, we present GaussVideoDreamer, which advances generative multimedia approaches by bridging the gap between image, video, and 3D generation, integrating their strengths through two key innovations: (1) A progressive video inpainting strategy that harnesses temporal coherence for improved multiview consistency and faster convergence. (2) A 3D Gaussian Splatting consistency mask to guide the video diffusion with 3D consistent multiview evidence. Our pipeline combines three core components: a geometry-aware initialization protocol, Inconsistency-Aware Gaussian Splatting, and a progressive video inpainting strategy. Experimental results demonstrate that our approach achieves 32% higher LLaVA-IQA scores and at least 2x speedup compared to existing methods while maintaining robust performance across diverse scenes.

GaussVideoDreamer: 3D Scene Generation with Video Diffusion and Inconsistency-Aware Gaussian Splatting

TL;DR

GaussVideoDreamer is presented, which advances generative multimedia approaches by bridging the gap between image, video, and 3D generation, integrating their strengths through two key innovations: a progressive video inpainting strategy that harnesses temporal coherence for improved multiview consistency and faster convergence and a 3D Gaussian Splatting consistency mask.

Abstract

Single-image 3D scene reconstruction presents significant challenges due to its inherently ill-posed nature and limited input constraints. Recent advances have explored two promising directions: multiview generative models that train on 3D consistent datasets but struggle with out-of-distribution generalization, and 3D scene inpainting and completion frameworks that suffer from cross-view inconsistency and suboptimal error handling, as they depend exclusively on depth data or 3D smoothness, which ultimately degrades output quality and computational performance. Building upon these approaches, we present GaussVideoDreamer, which advances generative multimedia approaches by bridging the gap between image, video, and 3D generation, integrating their strengths through two key innovations: (1) A progressive video inpainting strategy that harnesses temporal coherence for improved multiview consistency and faster convergence. (2) A 3D Gaussian Splatting consistency mask to guide the video diffusion with 3D consistent multiview evidence. Our pipeline combines three core components: a geometry-aware initialization protocol, Inconsistency-Aware Gaussian Splatting, and a progressive video inpainting strategy. Experimental results demonstrate that our approach achieves 32% higher LLaVA-IQA scores and at least 2x speedup compared to existing methods while maintaining robust performance across diverse scenes.

Paper Structure

This paper contains 26 sections, 11 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Multiview generative models exhibit limited generalization capability, manifesting as geometric inconsistencies and semantic distortions in synthesized views.
  • Figure 2: Overview of our pipeline. Our method first initializes a coarse video and inconsistency-aware GS (IA-GS) from a single input image (Sec. \ref{['sec:init']}). At periodic optimization intervals, we render all viewpoint images and their corresponding inconsistency prediction masks from the IA-GS representation (Sec. \ref{['sec:iags']}). These masks and rendered images then guide a video diffusion model to perform progressive inpainting, editing regions based on their inconsistency levels (Sec. \ref{['sec:refine']}). The refined video sequence subsequently optimizes our IA-GS module and gradually generates better novel view images.
  • Figure 3: Qualitative Results. Left: Input reference image. Middle: Novel view renderings and RGB-depth split from our 3DGS. Right: Depth map visualization. The results demonstrate that our method can generate coherent color and consistent geometry across both indoor and outdoor scenes.
  • Figure 4: Overview of initialization pipeline. We first lift the input image to a point cloud $\mathcal{P}$, then render $\mathcal{P}$ at auxiliary viewpoints, and inpaint the occluded regions. We warp the newly inpainted image to existing views and validate geometry through depth verification. Only geometrically consistent regions (via add mask) are added to $\mathcal{P}$, yielding a coarse 3D scene $\mathcal{P}_{aux}$ that initializes both our video generation and IA-GS.
  • Figure 5: Overview of our video refinement method. We compute progressive change maps from inconsistency-aware masks and refinement masks to guide inpainting. We progressively integrate reliable multiview evidence for higher-quality output.
  • ...and 3 more figures