Table of Contents
Fetching ...

GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors

Tian-Xing Xu, Xiangjun Gao, Wenbo Hu, Xiaoyu Li, Song-Hai Zhang, Ying Shan

TL;DR

GeometryCrafter tackles the challenge of obtaining metrically faithful, temporally coherent geometry from open-world videos. It introduces a point map VAE with a dual-encoder design that preserves a latent space aligned to diffusion priors, and a diffusion UNet conditioned on video latents and per-frame priors to generate high-quality point maps and depth. The approach achieves state-of-the-art 3D accuracy and temporal consistency across diverse datasets, enabling downstream tasks such as 3D/4D reconstruction, camera parameter estimation, and depth-conditioned video generation. A key trade-off is the method’s computational and memory overhead, which motivates future work on lightweight decoders.

Abstract

Despite remarkable advancements in video depth estimation, existing methods exhibit inherent limitations in achieving geometric fidelity through the affine-invariant predictions, limiting their applicability in reconstruction and other metrically grounded downstream tasks. We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos, enabling accurate 3D/4D reconstruction, camera parameter estimation, and other depth-based applications. At the core of our approach lies a point map Variational Autoencoder (VAE) that learns a latent space agnostic to video latent distributions for effective point map encoding and decoding. Leveraging the VAE, we train a video diffusion model to model the distribution of point map sequences conditioned on the input videos. Extensive evaluations on diverse datasets demonstrate that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.

GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors

TL;DR

GeometryCrafter tackles the challenge of obtaining metrically faithful, temporally coherent geometry from open-world videos. It introduces a point map VAE with a dual-encoder design that preserves a latent space aligned to diffusion priors, and a diffusion UNet conditioned on video latents and per-frame priors to generate high-quality point maps and depth. The approach achieves state-of-the-art 3D accuracy and temporal consistency across diverse datasets, enabling downstream tasks such as 3D/4D reconstruction, camera parameter estimation, and depth-conditioned video generation. A key trade-off is the method’s computational and memory overhead, which motivates future work on lightweight decoders.

Abstract

Despite remarkable advancements in video depth estimation, existing methods exhibit inherent limitations in achieving geometric fidelity through the affine-invariant predictions, limiting their applicability in reconstruction and other metrically grounded downstream tasks. We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos, enabling accurate 3D/4D reconstruction, camera parameter estimation, and other depth-based applications. At the core of our approach lies a point map Variational Autoencoder (VAE) that learns a latent space agnostic to video latent distributions for effective point map encoding and decoding. Leveraging the VAE, we train a video diffusion model to model the distribution of point map sequences conditioned on the input videos. Extensive evaluations on diverse datasets demonstrate that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.

Paper Structure

This paper contains 20 sections, 17 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Visual results on Sora-generated videos. The rows from left to right are the input videos, the disparity maps and the point cloud of the first frame.
  • Figure 2: Diffusion-based depth estimation methods, e.g., DepthCrafter hu2024-DepthCrafter and DAV yang2024dav, suffer from significant metric errors in distant regions due to the compression of unbounded depth values into the bounded input range of VAEs.
  • Figure 2: Visual comparison with monocular geometry estimation methods. All point maps are converted to disparity maps for better visualization the sharpness of depth prediction.
  • Figure 3: Architecture of our point map VAE. The point map VAE encodes and decodes point maps with unbounded values, alleviating the inaccurate prediction in distant regions. We adopt a dual-encoder design: the native encoder $\mathcal{E}_\text{SVD}$ inherited from SVD captures normalized disparity maps, while a residual encoder $\mathcal{E}_\epsilon$ embeds remaining information as an offset. It preserves the original latent space by regulating the latents via the original decoder $\mathcal{D}_\text{SVD}$, enabling the utilization of pretrained diffusion priors. A point map decoder $\mathcal{D}_\text{pmap}$ recovers the final point maps from the latent codes.
  • Figure 3: Visual results on DL3DV ling2024dl3dv with camera poses estimated from the output point maps. We concatenate 8 aligned point maps from the original point map sequence for visualization.
  • ...and 7 more figures