Table of Contents
Fetching ...

Can Video Diffusion Model Reconstruct 4D Geometry?

Jinjie Mai, Wenxuan Zhu, Haozhe Liu, Bing Li, Cheng Zheng, Jürgen Schmidhuber, Bernard Ghanem

TL;DR

Sora3R tackles the problem of reconstructing dynamic 4D geometry from monocular video by leveraging spatiotemporal priors learned by large scale video diffusion models. It introduces a two stage pipeline that first adapts a pointmap VAE from a pretrained video VAE and then fine tunes a transformer based diffusion backbone in the joint video and pointmap latent space to predict 4D pointmaps for all frames in a feedforward fashion. The approach eliminates external modules and heavy optimization, achieving robust camera pose estimation and detailed dynamic geometry across diverse scenes while remaining competitive with state of the art on dynamic data. The work demonstrates the potential of generative video models to inform and accelerate dynamic 3D reconstruction and points to scalable future improvements with stronger backbones and longer video sequences.

Abstract

Reconstructing dynamic 3D scenes (i.e., 4D geometry) from monocular video is an important yet challenging problem. Conventional multiview geometry-based approaches often struggle with dynamic motion, whereas recent learning-based methods either require specialized 4D representation or sophisticated optimization. In this paper, we present Sora3R, a novel framework that taps into the rich spatiotemporal priors of large-scale video diffusion models to directly infer 4D pointmaps from casual videos. Sora3R follows a two-stage pipeline: (1) we adapt a pointmap VAE from a pretrained video VAE, ensuring compatibility between the geometry and video latent spaces; (2) we finetune a diffusion backbone in combined video and pointmap latent space to generate coherent 4D pointmaps for every frame. Sora3R operates in a fully feedforward manner, requiring no external modules (e.g., depth, optical flow, or segmentation) or iterative global alignment. Extensive experiments demonstrate that Sora3R reliably recovers both camera poses and detailed scene geometry, achieving performance on par with state-of-the-art methods for dynamic 4D reconstruction across diverse scenarios.

Can Video Diffusion Model Reconstruct 4D Geometry?

TL;DR

Sora3R tackles the problem of reconstructing dynamic 4D geometry from monocular video by leveraging spatiotemporal priors learned by large scale video diffusion models. It introduces a two stage pipeline that first adapts a pointmap VAE from a pretrained video VAE and then fine tunes a transformer based diffusion backbone in the joint video and pointmap latent space to predict 4D pointmaps for all frames in a feedforward fashion. The approach eliminates external modules and heavy optimization, achieving robust camera pose estimation and detailed dynamic geometry across diverse scenes while remaining competitive with state of the art on dynamic data. The work demonstrates the potential of generative video models to inform and accelerate dynamic 3D reconstruction and points to scalable future improvements with stronger backbones and longer video sequences.

Abstract

Reconstructing dynamic 3D scenes (i.e., 4D geometry) from monocular video is an important yet challenging problem. Conventional multiview geometry-based approaches often struggle with dynamic motion, whereas recent learning-based methods either require specialized 4D representation or sophisticated optimization. In this paper, we present Sora3R, a novel framework that taps into the rich spatiotemporal priors of large-scale video diffusion models to directly infer 4D pointmaps from casual videos. Sora3R follows a two-stage pipeline: (1) we adapt a pointmap VAE from a pretrained video VAE, ensuring compatibility between the geometry and video latent spaces; (2) we finetune a diffusion backbone in combined video and pointmap latent space to generate coherent 4D pointmaps for every frame. Sora3R operates in a fully feedforward manner, requiring no external modules (e.g., depth, optical flow, or segmentation) or iterative global alignment. Extensive experiments demonstrate that Sora3R reliably recovers both camera poses and detailed scene geometry, achieving performance on par with state-of-the-art methods for dynamic 4D reconstruction across diverse scenarios.

Paper Structure

This paper contains 21 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Training Pipeline During training, with pretrained video VAE encoder ($\mathcal{E}_{\text{RGB}}$) and pointmap VAE encoder ($\mathcal{E}_{\text{XYZ}}$), video latent and noisy pointmap latent are concatenated for training latent diffusion transformer with denoising loss.
  • Figure 2: Inference Pipeline During testing, we sample random noise $\epsilon$ and concat it with video latent $\textbf{H}_{RGB}$. The denoised temporal pointmap latent $\textbf{H}_{XYZ}$ from DiT can finally be decoded to 4D pointmaps $\mathbf{\hat{P}}$ through $\mathcal{D}_{\text{XYZ}}$.
  • Figure 3: Visualization Comparisons From top to bottom is Sora3R (ours), MonST3R monst3r, DUSt3R dust3r, and groundtruth. For each method, the top row is the reconstructed depth map while the second row is the camera trajectory visualized together with the groundtruth trajectory. For groundtruth, the top row is depth while the second is the corresponding video frame. Since TUM-dynamics and ScanNet are obtained by depth cameras with missing or invalid pixels, they are marked with dark red.
  • Figure 4: Runtime Comparison Average runtime (in seconds) to process video sequence with $17\times384\times512$ spatiotemporal resolution, evaluated from $50$ runs on a single $A100$.
  • Figure 5: Limitation: Example Failure Case.Left: Video Frame; Middle: Recovered Depth; Right: Recovered Camera Trajectory. Sometimes, Sora3R predicts imbalanced pointclouds ranging from very near and very far, resulting in totally failed camera pose recovery.