Table of Contents
Fetching ...

World-Grounded Human Motion Recovery via Gravity-View Coordinates

Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, Xiaowei Zhou

TL;DR

A novel method for recovering world-grounded human motion from monocular video that recovers more realistic motion in both the camera space and world-grounded settings, outperforming state-of-the-art methods in both accuracy and speed.

Abstract

We present a novel method for recovering world-grounded human motion from monocular video. The main challenge lies in the ambiguity of defining the world coordinate system, which varies between sequences. Previous approaches attempt to alleviate this issue by predicting relative motion in an autoregressive manner, but are prone to accumulating errors. Instead, we propose estimating human poses in a novel Gravity-View (GV) coordinate system, which is defined by the world gravity and the camera view direction. The proposed GV system is naturally gravity-aligned and uniquely defined for each video frame, largely reducing the ambiguity of learning image-pose mapping. The estimated poses can be transformed back to the world coordinate system using camera rotations, forming a global motion sequence. Additionally, the per-frame estimation avoids error accumulation in the autoregressive methods. Experiments on in-the-wild benchmarks demonstrate that our method recovers more realistic motion in both the camera space and world-grounded settings, outperforming state-of-the-art methods in both accuracy and speed. The code is available at https://zju3dv.github.io/gvhmr/.

World-Grounded Human Motion Recovery via Gravity-View Coordinates

TL;DR

A novel method for recovering world-grounded human motion from monocular video that recovers more realistic motion in both the camera space and world-grounded settings, outperforming state-of-the-art methods in both accuracy and speed.

Abstract

We present a novel method for recovering world-grounded human motion from monocular video. The main challenge lies in the ambiguity of defining the world coordinate system, which varies between sequences. Previous approaches attempt to alleviate this issue by predicting relative motion in an autoregressive manner, but are prone to accumulating errors. Instead, we propose estimating human poses in a novel Gravity-View (GV) coordinate system, which is defined by the world gravity and the camera view direction. The proposed GV system is naturally gravity-aligned and uniquely defined for each video frame, largely reducing the ambiguity of learning image-pose mapping. The estimated poses can be transformed back to the world coordinate system using camera rotations, forming a global motion sequence. Additionally, the per-frame estimation avoids error accumulation in the autoregressive methods. Experiments on in-the-wild benchmarks demonstrate that our method recovers more realistic motion in both the camera space and world-grounded settings, outperforming state-of-the-art methods in both accuracy and speed. The code is available at https://zju3dv.github.io/gvhmr/.
Paper Structure (25 sections, 5 equations, 9 figures, 4 tables)

This paper contains 25 sections, 5 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Comparison of coordinate systems. In camera coordinates, a person may appear inclined due to the camera's roll and pitch movement. In contrast, in GV coordinates, the person is naturally aligned with gravity.
  • Figure 2: Overview of the proposed framework. Given a monocular video (left), following WHAM wham, GVHMR preprocesses the video by tracking the human bounding box, detecting 2D keypoints, extracting image features, and estimating camera relative rotation using visual odometry or a gyroscope. GVHMR then fuses these features into per-frame tokens, which are processed with a relative transformer and multitask MLPs. The outputs include: (1) intermediate representations (middle), i.e. human orientation in the Gravity-View coordinate system, root velocity in the SMPL coordinate system, and the stationary probability for predefined joints; and (2) camera frame SMPL parameters (right-top). Finally, the global trajectory (right-bottom) is recovered by transforming the intermediate representations to the world coordinate system, as described in Sec. 3.1.
  • Figure 3: Gravity-View (GV) coordinate system, defined by the gravity direction and the camera view direction. (Refer to Sec. 3.1 for details).
  • Figure 4: Relative rotation between two GV coordinate systems. (a) shows two adjacent GV coordinate systems and the camera view directions. (b) illustrates the relative rotation between two GV systems. $R_{\Delta GV}$ occurs exclusively around the y-axis (gravity direction).
  • Figure 5: Network architecture. The input features are fused into per-frame tokens by the early-fusion module, processed by the relative transformer, and then output by multitask MLPs as intermediate representations. The weak-camera parameter $cw$ is restored to the camera frame $\tau_c$ following cliff. The predicted $\Gamma_{GV}$ and $v_{root}$ are converted to the world frame $\Gamma_w$ and $\tau_w$, as described in Sec. \ref{['sec:global_traj']}. Finally, we use joint stationary probabilities $p_s$ to post-process the global motion.
  • ...and 4 more figures