Synergistic Global-space Camera and Human Reconstruction from Videos

Yizhou Zhao; Tuanfeng Y. Wang; Bhiksha Raj; Min Xu; Jimei Yang; Chun-Hao Paul Huang

Synergistic Global-space Camera and Human Reconstruction from Videos

Yizhou Zhao, Tuanfeng Y. Wang, Bhiksha Raj, Min Xu, Jimei Yang, Chun-Hao Paul Huang

TL;DR

SynCHMR addresses the challenge of jointly reconstructing metric-scale camera motion, dense scene geometry, and 3D human Meshes from monocular videos by tightly integrating HMR with SLAM. It introduces a two-phase approach: (i) Human-aware Metric SLAM that leverages camera-frame HMR as a strong prior to resolve depth, scale, and dynamics, and (ii) Scene-aware SMPL Denoising that denoises world-frame humans conditioned on dense scene cues, enforcing spatiotemporal coherence. The key contributions are a) a metric-scale SLAM procedure guided by human priors to achieve accurate camera trajectories and scene geometry, and b) a scene-conditioned denoiser that refines SMPL parameters using dynamic scene information, enabling coherent global reconstructions without requiring pre-scanned scenes. The approach yields state-of-the-art or competitive results on real-world benchmarks, producing unified reconstructions of humans, camera motion, and dense scenes in a common world frame with practical applications in animation, AR/VR, and visual effects.

Abstract

Remarkable strides have been made in reconstructing static scenes or human bodies from monocular videos. Yet, the two problems have largely been approached independently, without much synergy. Most visual SLAM methods can only reconstruct camera trajectories and scene structures up to scale, while most HMR methods reconstruct human meshes in metric scale but fall short in reasoning with cameras and scenes. This work introduces Synergistic Camera and Human Reconstruction (SynCHMR) to marry the best of both worlds. Specifically, we design Human-aware Metric SLAM to reconstruct metric-scale camera poses and scene point clouds using camera-frame HMR as a strong prior, addressing depth, scale, and dynamic ambiguities. Conditioning on the dense scene recovered, we further learn a Scene-aware SMPL Denoiser to enhance world-frame HMR by incorporating spatio-temporal coherency and dynamic scene constraints. Together, they lead to consistent reconstructions of camera trajectories, human meshes, and dense scene point clouds in a common world frame. Project page: https://paulchhuang.github.io/synchmr

Synergistic Global-space Camera and Human Reconstruction from Videos

TL;DR

Abstract

Paper Structure (24 sections, 14 equations, 6 figures, 6 tables)

This paper contains 24 sections, 14 equations, 6 figures, 6 tables.

Introduction
Related Work
Method
Preliminaries
SLAM
HMR
Human-aware Metric SLAM
Preprocessing
Calibrating Depth with Human Prior
Disambiguating SLAM with Calibrated Depth
Scene-aware SMPL Denoising
Initializing Humans with Metric Cameras
Constraining Humans with Dynamic Scenes
Experiments
Experimental Setting
...and 9 more sections

Figures (6)

Figure 1: Illustration of three types of ambiguities in visual SLAM. We show SLAM reconstruction results from DROID-SLAM teed2021droid. (a) Depth ambiguity occurs when there are only minor camera translations between different views. This can lead to geometric failures in reconstruction such as the folded back corridor in the side view. (b) Scale ambiguity is inherent in monocular SLAM systems and requires additional reference for disambiguation. (c) Dynamic ambiguity gets pronounced when moving foregrounds dominate frames. Over-reliance on foreground key points will result in incorrect camera trajectories.
Figure 2: The architecture of SynCHMR. Our pipeline comprises two phases. The first phase, Human-aware Metric SLAM (\ref{['subsec:human-aware-slam']}), infers metric-scale camera poses and metric-scale point clouds by exploiting the camera-frame human prior. The second phase, Scene-aware SMPL Denoising (\ref{['subsec:smpl_denoising']}), involves the conditional denoising of world-frame noisy SMPL parameters. These parameters, initialized by transforming from the camera frame, get refined through conditioning on the dynamic point clouds obtained in the first phase. The whole pipeline thus reconstructs humans, scene point clouds, and cameras harmoniously in a common world frame.
Figure 3: The architecture of Scene-aware SMPL Denoiser. World-frame noisy SMPL parameters $\{\boldsymbol \Phi _{nt}^\text{w}, \boldsymbol \theta _{nt}, \boldsymbol \beta _{nt}, \boldsymbol \Gamma _{nt}^\text{w}\}_0$ are first projected by a linear layer and summed with temporal positional embeddings (TPE) to get initial latent humans $\{\mathbf{z}_{nt,0}^\text{SMPL}\}$. Per-frame point clouds are aggregated to $\mathbf{x}_\text{scene}$ and encoded with the point encoder $\mathcal{E}$. Then we query the encoded scene $\mathcal{E}(\mathbf{x}^\text{scene})$ with latent humans $\{\mathbf{z}_{nt,0}^\text{SMPL}\}$ in the scene-conditioned denoiser $\mathcal{D}$ and feed the result $\{\mathbf{z}_{nt,1}^\text{SMPL}\}$ to prediction heads $\{\mathcal{P}_{\boldsymbol \Phi },\mathcal{P}_{\boldsymbol \theta },\mathcal{P}_{\boldsymbol \beta },\mathcal{P}_{\boldsymbol \Gamma }\}$ to obtain denoised SMPL parameters $\{\boldsymbol \Phi _{nt}^\text{w}, \boldsymbol \theta _{nt}, \boldsymbol \beta _{nt}, \boldsymbol \Gamma _{nt}^\text{w}\}_1$.
Figure 4: Qualitative comparison among world-frame HMR approaches. We show (b) GLAMR yuan2021glamr and (c) TRACE sun2023trace results with their pre-defined ground planes, (d) SLAHMR ye2023slahmr outputs with its estimated ground plane, and (e) our SynCHMR outputs with dense scenes. In the first row, we also demonstrate top-view human trajectories within circles. See supplementary for video results.
Figure 5: Qualitative comparisons of the parkour sequence from DAVIS Perazzi2016. (a) naive DROID-SLAM teed2021droid reconstructed point cloud with RGB input; (b) DROID-SLAM reconstructed point cloud with RGB input, where the foreground humans are masked out by an instance segmentation method Mask2Former cheng2022masked; (c) DROID-SLAM reconstructed point cloud with RGB-D input, where the depth channel is from ZoeDepth bhat2023zoedepth estimations, the same below; (d) DROID-SLAM reconstructed point cloud with RGB-D and instance segmentation mask inputs (e) our proposed Human-aware Metric SLAM reconstructed point cloud. Please see the webpage for video results.
...and 1 more figures

Synergistic Global-space Camera and Human Reconstruction from Videos

TL;DR

Abstract

Synergistic Global-space Camera and Human Reconstruction from Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (6)