Table of Contents
Fetching ...

Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction

Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, Andrea Vedaldi

TL;DR

Geo4D answers the challenge of monocular 4D reconstruction for dynamic scenes by repurposing a pre-trained video diffusion model. It predicts and jointly fuses three geometric modalities—viewpoint-invariant point maps, disparity maps, and ray maps—trained entirely on synthetic data and refined through a multi-modal alignment process with a temporal sliding window. The method achieves substantial improvements in video depth estimation and competitive camera pose results, demonstrating strong generalization to real data without per-video optimization. This work suggests a path toward embedding explicit 4D geometry into video foundation models and paves the way for diffusion-based dynamic scene understanding with synthetic-to-real transfer capabilities.

Abstract

We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic priors captured by large-scale pre-trained video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, disparity, and ray maps. We propose a new multi-modal alignment algorithm to align and fuse these modalities, as well as a sliding window approach at inference time, thus enabling robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods.

Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction

TL;DR

Geo4D answers the challenge of monocular 4D reconstruction for dynamic scenes by repurposing a pre-trained video diffusion model. It predicts and jointly fuses three geometric modalities—viewpoint-invariant point maps, disparity maps, and ray maps—trained entirely on synthetic data and refined through a multi-modal alignment process with a temporal sliding window. The method achieves substantial improvements in video depth estimation and competitive camera pose results, demonstrating strong generalization to real data without per-video optimization. This work suggests a path toward embedding explicit 4D geometry into video foundation models and paves the way for diffusion-based dynamic scene understanding with synthetic-to-real transfer capabilities.

Abstract

We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic priors captured by large-scale pre-trained video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, disparity, and ray maps. We propose a new multi-modal alignment algorithm to align and fuse these modalities, as well as a sliding window approach at inference time, thus enabling robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods.

Paper Structure

This paper contains 43 sections, 10 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Geo4D repurposes a video diffusion model xing2024dynamicrafter for monocular 4D reconstruction. It uses only synthetic data for training, yet generalizes well to out-of-domain real videos. It predicts several geometric modalities, including point maps, disparity maps, and ray maps, fusing and aligning them to obtain state-of-the-art dynamic reconstruction even for scenes with extreme object and camera motion.
  • Figure 2: Overview of Geo4D. During training, video conditions are injected by locally concatenating the latent feature of the video with diffused geometric features $\boldsymbol{\mathbf{z}}_t^{\boldsymbol{\mathbf{X}}}, \boldsymbol{\mathbf{z}}_t^{\boldsymbol{\mathbf{D}}}, \boldsymbol{\mathbf{z}}_t^{\boldsymbol{\mathbf{r}}}$ and are injected globally via cross-attention in the denoising U-Net, after CLIP encoding and a query transformer. The U-Net is fine-tuned via Eq. \ref{['eq:vdm']}. During inference, iteratively denoised latent features $\hat{\boldsymbol{\mathbf{z}}}_0^{\boldsymbol{\mathbf{X}}}, \hat{\boldsymbol{\mathbf{z}}}_0^{\boldsymbol{\mathbf{D}}}, \hat{\boldsymbol{\mathbf{z}}}_0^{\boldsymbol{\mathbf{r}}}$ are decoded by the fine-tuned VAE decoder, followed by multi-modal alignment optimization for coherent 4D reconstruction.
  • Figure 3: Qualitative results comparing Geo4D with MonST3R zhang24monst3r. Attributed to our group-wise inference manner and prior geometry knowledge from pretrained video diffusion, our model successfully produces consistent 4D geometry under fast motion (first row) and deceptive reflection in the water (second row).
  • Figure 4: Qualitative video depth results comparing Geo4D with MonST3R zhang24monst3r and DepthCrafter hu25depthcrafter. Owing to our proposed multi-modal training and alignment, as well as the prior knowledge from diffusion, our method can infer a more detailed structure (first row) and a more accurate spatial arrangement from video (second row).
  • Figure 5: Additional qualitative results. Our method generalizes well to various scenes with different 4D objects and performs robustly against different camera and object motions.
  • ...and 1 more figures