Table of Contents
Fetching ...

FrankMocap: Fast Monocular 3D Hand and Body Motion Capture by Regression and Integration

Yu Rong, Takaaki Shiratori, Hanbyul Joo

TL;DR

FrankMocap tackles the challenge of monocularly capturing simultaneous 3D hand and body motion by separating the problem into hand and body regression modules that output SMPL-X parameters, followed by an integration step to form a cohesive whole-body representation. The system achieves near real-time performance (~9.5 fps) with a fast copy-and-paste integration, and offers an optional optimization-based refinement that leverages 2D keypoints and exemplar priors for improved accuracy. It demonstrates state-of-the-art hand pose accuracy on public benchmarks and strong whole-body results in diverse, in-the-wild scenes, including live demos. Ablation studies validate the benefits of multi-dataset training and motion blur augmentation for robust in-the-wild generalization.

Abstract

Although the essential nuance of human motion is often conveyed as a combination of body movements and hand gestures, the existing monocular motion capture approaches mostly focus on either body motion capture only ignoring hand parts or hand motion capture only without considering body motion. In this paper, we present FrankMocap, a motion capture system that can estimate both 3D hand and body motion from in-the-wild monocular inputs with faster speed (9.5 fps) and better accuracy than previous work. Our method works in near real-time (9.5 fps) and produces 3D body and hand motion capture outputs as a unified parametric model structure. Our method aims to capture 3D body and hand motion simultaneously from challenging in-the-wild monocular videos. To construct FrankMocap, we build the state-of-the-art monocular 3D "hand" motion capture method by taking the hand part of the whole body parametric model (SMPL-X). Our 3D hand motion capture output can be efficiently integrated to monocular body motion capture output, producing whole body motion results in a unified parrametric model structure. We demonstrate the state-of-the-art performance of our hand motion capture system in public benchmarks, and show the high quality of our whole body motion capture result in various challenging real-world scenes, including a live demo scenario.

FrankMocap: Fast Monocular 3D Hand and Body Motion Capture by Regression and Integration

TL;DR

FrankMocap tackles the challenge of monocularly capturing simultaneous 3D hand and body motion by separating the problem into hand and body regression modules that output SMPL-X parameters, followed by an integration step to form a cohesive whole-body representation. The system achieves near real-time performance (~9.5 fps) with a fast copy-and-paste integration, and offers an optional optimization-based refinement that leverages 2D keypoints and exemplar priors for improved accuracy. It demonstrates state-of-the-art hand pose accuracy on public benchmarks and strong whole-body results in diverse, in-the-wild scenes, including live demos. Ablation studies validate the benefits of multi-dataset training and motion blur augmentation for robust in-the-wild generalization.

Abstract

Although the essential nuance of human motion is often conveyed as a combination of body movements and hand gestures, the existing monocular motion capture approaches mostly focus on either body motion capture only ignoring hand parts or hand motion capture only without considering body motion. In this paper, we present FrankMocap, a motion capture system that can estimate both 3D hand and body motion from in-the-wild monocular inputs with faster speed (9.5 fps) and better accuracy than previous work. Our method works in near real-time (9.5 fps) and produces 3D body and hand motion capture outputs as a unified parametric model structure. Our method aims to capture 3D body and hand motion simultaneously from challenging in-the-wild monocular videos. To construct FrankMocap, we build the state-of-the-art monocular 3D "hand" motion capture method by taking the hand part of the whole body parametric model (SMPL-X). Our 3D hand motion capture output can be efficiently integrated to monocular body motion capture output, producing whole body motion results in a unified parrametric model structure. We demonstrate the state-of-the-art performance of our hand motion capture system in public benchmarks, and show the high quality of our whole body motion capture result in various challenging real-world scenes, including a live demo scenario.

Paper Structure

This paper contains 13 sections, 13 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of our pipeline for whole body motion capture. Given a single RGB image input, we apply our hand module and body module to estimate 3D hands and 3D body. Our integration module combines these outputs into a unified whole body output.
  • Figure 2: Overall framework of our hand module. Our hand module takes a cropped hand image $\mathbf{I}_H$ as input, and produces the parameters of hand model, $[\boldsymbol{\phi}_h, \boldsymbol{\theta}_h, \boldsymbol{\beta}_h, \boldsymbol{c}_h]$. Our hand module is built by a deep encoder-decoder network. The predicted hand parameter is used to produce the mesh shape and pose of the hand part of SMPL-X.
  • Figure 3: Our hand model taken from SMPL-X. We take the hand part of SMPL-X as a stand-alone hand model for hand pose estimation. The example mesh is shown in (a) and the skeleton hierarchy is shown in (b).
  • Figure 4: Motion Blur Augmentation. We show example images of motion blur augmentation. From left to right: original images, augmented images after applying different motion blur kernels.
  • Figure 5: Optimizing the whole body model (SMPL-X) with 3D hand prediction and 2D keypoint estimation. (a) An input image and the estimated 2D keypoints by OpenPose Cao:2019:Openpose; (b) 3D body pose estimation from our body module; (c) The output of 3D hand module aligned to the wrist joints of SMPL-X; (d) Integration output by copy-and-paste strategy; (e) Integration output by our optimization framework.
  • ...and 4 more figures