Table of Contents
Fetching ...

GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion

Gwanghyun Kim, Xueting Li, Ye Yuan, Koki Nagano, Tianye Li, Jan Kautz, Se Young Chun, Umar Iqbal

TL;DR

GeoMan tackles temporally consistent 3D human geometry estimation from monocular video under limited 4D data. It reframes video geometry estimation as an image-to-video diffusion task by decoupling an image-based first-frame geometry estimator (I2G) from a video-conditioned generator (V2G), leveraging diffusion priors learned from large-scale videos. A root-relative depth representation preserves human scale while enabling metric depth recovery, addressing both scale and temporal stability. Across depth and normal estimation, GeoMan achieves state-of-the-art performance and strong generalization to in-the-wild videos, outperforming baselines trained on much larger proprietary data.

Abstract

Estimating accurate and temporally consistent 3D human geometry from videos is a challenging problem in computer vision. Existing methods, primarily optimized for single images, often suffer from temporal inconsistencies and fail to capture fine-grained dynamic details. To address these limitations, we present GeoMan, a novel architecture designed to produce accurate and temporally consistent depth and normal estimations from monocular human videos. GeoMan addresses two key challenges: the scarcity of high-quality 4D training data and the need for metric depth estimation to accurately model human size. To overcome the first challenge, GeoMan employs an image-based model to estimate depth and normals for the first frame of a video, which then conditions a video diffusion model, reframing video geometry estimation task as an image-to-video generation problem. This design offloads the heavy lifting of geometric estimation to the image model and simplifies the video model's role to focus on intricate details while using priors learned from large-scale video datasets. Consequently, GeoMan improves temporal consistency and generalizability while requiring minimal 4D training data. To address the challenge of accurate human size estimation, we introduce a root-relative depth representation that retains critical human-scale details and is easier to be estimated from monocular inputs, overcoming the limitations of traditional affine-invariant and metric depth representations. GeoMan achieves state-of-the-art performance in both qualitative and quantitative evaluations, demonstrating its effectiveness in overcoming longstanding challenges in 3D human geometry estimation from videos.

GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion

TL;DR

GeoMan tackles temporally consistent 3D human geometry estimation from monocular video under limited 4D data. It reframes video geometry estimation as an image-to-video diffusion task by decoupling an image-based first-frame geometry estimator (I2G) from a video-conditioned generator (V2G), leveraging diffusion priors learned from large-scale videos. A root-relative depth representation preserves human scale while enabling metric depth recovery, addressing both scale and temporal stability. Across depth and normal estimation, GeoMan achieves state-of-the-art performance and strong generalization to in-the-wild videos, outperforming baselines trained on much larger proprietary data.

Abstract

Estimating accurate and temporally consistent 3D human geometry from videos is a challenging problem in computer vision. Existing methods, primarily optimized for single images, often suffer from temporal inconsistencies and fail to capture fine-grained dynamic details. To address these limitations, we present GeoMan, a novel architecture designed to produce accurate and temporally consistent depth and normal estimations from monocular human videos. GeoMan addresses two key challenges: the scarcity of high-quality 4D training data and the need for metric depth estimation to accurately model human size. To overcome the first challenge, GeoMan employs an image-based model to estimate depth and normals for the first frame of a video, which then conditions a video diffusion model, reframing video geometry estimation task as an image-to-video generation problem. This design offloads the heavy lifting of geometric estimation to the image model and simplifies the video model's role to focus on intricate details while using priors learned from large-scale video datasets. Consequently, GeoMan improves temporal consistency and generalizability while requiring minimal 4D training data. To address the challenge of accurate human size estimation, we introduce a root-relative depth representation that retains critical human-scale details and is easier to be estimated from monocular inputs, overcoming the limitations of traditional affine-invariant and metric depth representations. GeoMan achieves state-of-the-art performance in both qualitative and quantitative evaluations, demonstrating its effectiveness in overcoming longstanding challenges in 3D human geometry estimation from videos.

Paper Structure

This paper contains 27 sections, 10 equations, 18 figures, 9 tables.

Figures (18)

  • Figure 1: GeoMan provides accurate and temporally stable geometric predictions for human videos, surpassing existing methods.
  • Figure 2: Overview of GeoMan: (a) Given a video sequence $\mathbf{X}^{(1:F)}$ as input, we first use I2G to estimate the normal or depth of the first frame $\mathbf{X}^{(1)}$. This initial prediction is then used to condition the V2G model, which generates predictions for the entire input sequence. GeoMan seamlessly handles both depth and normal estimation tasks using the same model weights, requiring only a replacement of the input condition for the first frame. (b) We propose a human-centered root-relative depth representation, which retains the human scale information and enables better temporal modeling.
  • Figure 3: Comparison of depth representations: Our representation offers the highest fidelity and improves temporal modeling.
  • Figure 4: Zero-shot normal estimation comparison on ActorsHQ. Left: Predicted normal for the first frame. Middle: Predicted normal for the second frame. Right: Angular error visualization for the second frame.
  • Figure 5: Comparison with existing depth estimation models. GeoMan achieves state-of-the-art performance in both depth prediction (top row in each result) and point cloud reconstruction (bottom row), excelling in temporal consistency, fidelity, and scale preservation.
  • ...and 13 more figures