Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction
Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, Andrea Vedaldi
TL;DR
Geo4D answers the challenge of monocular 4D reconstruction for dynamic scenes by repurposing a pre-trained video diffusion model. It predicts and jointly fuses three geometric modalities—viewpoint-invariant point maps, disparity maps, and ray maps—trained entirely on synthetic data and refined through a multi-modal alignment process with a temporal sliding window. The method achieves substantial improvements in video depth estimation and competitive camera pose results, demonstrating strong generalization to real data without per-video optimization. This work suggests a path toward embedding explicit 4D geometry into video foundation models and paves the way for diffusion-based dynamic scene understanding with synthetic-to-real transfer capabilities.
Abstract
We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic priors captured by large-scale pre-trained video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, disparity, and ray maps. We propose a new multi-modal alignment algorithm to align and fuse these modalities, as well as a sliding window approach at inference time, thus enabling robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods.
