Table of Contents
Fetching ...

Motion Diffusion-Guided 3D Global HMR from a Dynamic Camera

Jaewoo Heo, Kuan-Chieh Wang, Karen Liu, Serena Yeung-Levy

TL;DR

The key insight is that recent advances in human motion generation, such as the motion diffusion model (MDM), contain a strong prior of coherent human motion, which can lead to more globally coherent human motion.

Abstract

Motion capture technologies have transformed numerous fields, from the film and gaming industries to sports science and healthcare, by providing a tool to capture and analyze human movement in great detail. The holy grail in the topic of monocular global human mesh and motion reconstruction (GHMR) is to achieve accuracy on par with traditional multi-view capture on any monocular videos captured with a dynamic camera, in-the-wild. This is a challenging task as the monocular input has inherent depth ambiguity, and the moving camera adds additional complexity as the rendered human motion is now a product of both human and camera movement. Not accounting for this confusion, existing GHMR methods often output motions that are unrealistic, e.g. unaccounted root translation of the human causes foot sliding. We present DiffOpt, a novel 3D global HMR method using Diffusion Optimization. Our key insight is that recent advances in human motion generation, such as the motion diffusion model (MDM), contain a strong prior of coherent human motion. The core of our method is to optimize the initial motion reconstruction using the MDM prior. This step can lead to more globally coherent human motion. Our optimization jointly optimizes the motion prior loss and reprojection loss to correctly disentangle the human and camera motions. We validate DiffOpt with video sequences from the Electromagnetic Database of Global 3D Human Pose and Shape in the Wild (EMDB) and Egobody, and demonstrate superior global human motion recovery capability over other state-of-the-art global HMR methods most prominently in long video settings.

Motion Diffusion-Guided 3D Global HMR from a Dynamic Camera

TL;DR

The key insight is that recent advances in human motion generation, such as the motion diffusion model (MDM), contain a strong prior of coherent human motion, which can lead to more globally coherent human motion.

Abstract

Motion capture technologies have transformed numerous fields, from the film and gaming industries to sports science and healthcare, by providing a tool to capture and analyze human movement in great detail. The holy grail in the topic of monocular global human mesh and motion reconstruction (GHMR) is to achieve accuracy on par with traditional multi-view capture on any monocular videos captured with a dynamic camera, in-the-wild. This is a challenging task as the monocular input has inherent depth ambiguity, and the moving camera adds additional complexity as the rendered human motion is now a product of both human and camera movement. Not accounting for this confusion, existing GHMR methods often output motions that are unrealistic, e.g. unaccounted root translation of the human causes foot sliding. We present DiffOpt, a novel 3D global HMR method using Diffusion Optimization. Our key insight is that recent advances in human motion generation, such as the motion diffusion model (MDM), contain a strong prior of coherent human motion. The core of our method is to optimize the initial motion reconstruction using the MDM prior. This step can lead to more globally coherent human motion. Our optimization jointly optimizes the motion prior loss and reprojection loss to correctly disentangle the human and camera motions. We validate DiffOpt with video sequences from the Electromagnetic Database of Global 3D Human Pose and Shape in the Wild (EMDB) and Egobody, and demonstrate superior global human motion recovery capability over other state-of-the-art global HMR methods most prominently in long video settings.

Paper Structure

This paper contains 25 sections, 7 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: (top) DiffOptsystem architecture. Given an input video with $T(n)$ frames, DiffOpt uses neural motion fields to predict the pose, root orientation, and global root translation for each frame. We regress these parameters using the SMPL loper2015smpl body model to get the 3D joint and vertex positions. Our predicted motion is then constrained by 3D loss against initial predictions from off-the-shelf HMR models goel2023humans, 2D re-projection loss against predictions from 2D keypoint detection models xu2022vitpose, and motion prior loss from the motion diffusion model tevet2022human. (bottom) The MDM-SDS loss poole2022dreamfusion is computed by transforming the neural motion fields' predicted parameters to MDM's input format, running the noising and de-noising steps to compute the posterior, and using this to compute the SDS guidance poole2022dreamfusion. This guidance term is back-propagated to the neural motion fields.
  • Figure 2: Qualitative results on a trimmed segment in the 'soccer warmup' EMDB sequence kaufmann2023emdb. This is a challenging motion sequence, as the human subject continuously twists his hips while making quick side-steps. 3D human meshes have been rendered on the original video sequences for GLAMR yuan2022glamr on the top row, SLAHMR ye2023decoupling on the middle row and DiffOpt on the bottom row. Moreover, the ground-truth global root trajectory and each model's predicted global root trajectory have been visualized next to the original video renderings.