Table of Contents
Fetching ...

GenFusion: Closing the Loop between Reconstruction and Generation via Videos

Sibo Wu, Congrong Xu, Binbin Huang, Andreas Geiger, Anpei Chen

TL;DR

GenFusion addresses the mismatch between dense 3D reconstruction and single-view generation by introducing a reconstruction-driven video diffusion model and a cyclic fusion loop that uses generated content to regularize reconstruction. It leverages masked 3D reconstruction to create artifact-prone training data, depth-aware RGB-D VAE conditioning, and a diffusion-guided feedback mechanism to expand viewpoint coverage and mitigate view-saturation. The method demonstrates improved sparse-view view synthesis, robust extrapolation, and scene completion across diverse datasets, highlighting a practical path to artifact-free 3D asset generation and scalable content augmentation using video priors. Overall, GenFusion offers a principled framework to integrate reconstruction and generation for more reliable and versatile 3D scene synthesis.

Abstract

Recently, 3D reconstruction and generation have demonstrated impressive novel view synthesis results, achieving high fidelity and efficiency. However, a notable conditioning gap can be observed between these two fields, e.g., scalable 3D scene reconstruction often requires densely captured views, whereas 3D generation typically relies on a single or no input view, which significantly limits their applications. We found that the source of this phenomenon lies in the misalignment between 3D constraints and generative priors. To address this problem, we propose a reconstruction-driven video diffusion model that learns to condition video frames on artifact-prone RGB-D renderings. Moreover, we propose a cyclical fusion pipeline that iteratively adds restoration frames from the generative model to the training set, enabling progressive expansion and addressing the viewpoint saturation limitations seen in previous reconstruction and generation pipelines. Our evaluation, including view synthesis from sparse view and masked input, validates the effectiveness of our approach. More details at https://genfusion.sibowu.com.

GenFusion: Closing the Loop between Reconstruction and Generation via Videos

TL;DR

GenFusion addresses the mismatch between dense 3D reconstruction and single-view generation by introducing a reconstruction-driven video diffusion model and a cyclic fusion loop that uses generated content to regularize reconstruction. It leverages masked 3D reconstruction to create artifact-prone training data, depth-aware RGB-D VAE conditioning, and a diffusion-guided feedback mechanism to expand viewpoint coverage and mitigate view-saturation. The method demonstrates improved sparse-view view synthesis, robust extrapolation, and scene completion across diverse datasets, highlighting a practical path to artifact-free 3D asset generation and scalable content augmentation using video priors. Overall, GenFusion offers a principled framework to integrate reconstruction and generation for more reliable and versatile 3D scene synthesis.

Abstract

Recently, 3D reconstruction and generation have demonstrated impressive novel view synthesis results, achieving high fidelity and efficiency. However, a notable conditioning gap can be observed between these two fields, e.g., scalable 3D scene reconstruction often requires densely captured views, whereas 3D generation typically relies on a single or no input view, which significantly limits their applications. We found that the source of this phenomenon lies in the misalignment between 3D constraints and generative priors. To address this problem, we propose a reconstruction-driven video diffusion model that learns to condition video frames on artifact-prone RGB-D renderings. Moreover, we propose a cyclical fusion pipeline that iteratively adds restoration frames from the generative model to the training set, enabling progressive expansion and addressing the viewpoint saturation limitations seen in previous reconstruction and generation pipelines. Our evaluation, including view synthesis from sparse view and masked input, validates the effectiveness of our approach. More details at https://genfusion.sibowu.com.

Paper Structure

This paper contains 21 sections, 6 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: From top to bottom: 2DGS baseline, with train view monocular depth added, with sample view RGB added, with sample view depth added, and finally with sparsity-aware densification.
  • Figure 2: GenFusion pipeline. Our approach contains two stages: video diffusion pre-training (left) and zero-shot generalization (right). In pre-training, we first fine-tune DynamiCrafter dynamicrafter on RGB-D videos from a large-scale real-world scene video dataset DL3DV. Captured videos are patchified, and a random patch sequence is selected for 3D scene reconstruction, rendering full-frame RGB-D videos as input to our video diffusion model, supervised by the original video capture and its monocular depth. During generalization, we treat reconstruction and generation as a cyclical process, iteratively adding restoration frames from the generative model to the training set for artifact removal and scene completion.
  • Figure 2: Qualitative comparison of novel view synthesis using masked input on TnT scenes Knapitsch2017.
  • Figure 3: Artifact-GT video pair generation using masked reconstruction. a) current SOTA Gaussian Splatting methods render accurately near training views but produce artifacts for distant views due to limited angular supervision, like the red trajectory. b) we propose a masked reconstruction scheme to replicate such artifact patterns for training video diffusion models by masking $75\%$ of pixels during 3D reconstruction and re-rendering the scene along the original trajectory, including the masked pixels.
  • Figure 4: Reconstruction-driven Video Generation. Our video diffusion model is able to generate realistic RGB-D video from artifact-prone RGB-D renderings, which is then used as photometric guidance in our cyclic fusion period.
  • ...and 2 more figures