Table of Contents
Fetching ...

Look Ma, no markers: holistic performance capture without the hassle

Charlie Hewitt, Fatemeh Saleh, Sadegh Aliakbarian, Lohit Petikam, Shideh Rezaeifar, Louis Florentin, Zafiirah Hosenie, Thomas J Cashman, Julien Valentin, Darren Cosker, Tadas Baltrusaitis

TL;DR

This work introduces the first technique for markerfree, high-quality reconstruction of the complete human body, including eyes and tongue, without requiring any calibration, manual intervention or custom hardware and produces stable world-space results from arbitrary camera rigs as well as supporting varied capture environments and clothing.

Abstract

We tackle the problem of highly-accurate, holistic performance capture for the face, body and hands simultaneously. Motion-capture technologies used in film and game production typically focus only on face, body or hand capture independently, involve complex and expensive hardware and a high degree of manual intervention from skilled operators. While machine-learning-based approaches exist to overcome these problems, they usually only support a single camera, often operate on a single part of the body, do not produce precise world-space results, and rarely generalize outside specific contexts. In this work, we introduce the first technique for marker-free, high-quality reconstruction of the complete human body, including eyes and tongue, without requiring any calibration, manual intervention or custom hardware. Our approach produces stable world-space results from arbitrary camera rigs as well as supporting varied capture environments and clothing. We achieve this through a hybrid approach that leverages machine learning models trained exclusively on synthetic data and powerful parametric models of human shape and motion. We evaluate our method on a number of body, face and hand reconstruction benchmarks and demonstrate state-of-the-art results that generalize on diverse datasets.

Look Ma, no markers: holistic performance capture without the hassle

TL;DR

This work introduces the first technique for markerfree, high-quality reconstruction of the complete human body, including eyes and tongue, without requiring any calibration, manual intervention or custom hardware and produces stable world-space results from arbitrary camera rigs as well as supporting varied capture environments and clothing.

Abstract

We tackle the problem of highly-accurate, holistic performance capture for the face, body and hands simultaneously. Motion-capture technologies used in film and game production typically focus only on face, body or hand capture independently, involve complex and expensive hardware and a high degree of manual intervention from skilled operators. While machine-learning-based approaches exist to overcome these problems, they usually only support a single camera, often operate on a single part of the body, do not produce precise world-space results, and rarely generalize outside specific contexts. In this work, we introduce the first technique for marker-free, high-quality reconstruction of the complete human body, including eyes and tongue, without requiring any calibration, manual intervention or custom hardware. Our approach produces stable world-space results from arbitrary camera rigs as well as supporting varied capture environments and clothing. We achieve this through a hybrid approach that leverages machine learning models trained exclusively on synthetic data and powerful parametric models of human shape and motion. We evaluate our method on a number of body, face and hand reconstruction benchmarks and demonstrate state-of-the-art results that generalize on diverse datasets.

Paper Structure

This paper contains 54 sections, 21 equations, 21 figures, 10 tables.

Figures (21)

  • Figure 1: Construction of synthetic image by sampling random identity and pose for the SOMA parametric model, clothing, hair and accessory assets, textures, shaders and HDRI environments from our asset library (top row). From this we can produce a highly realistic image and a number of corresponding ground-truth annotations (bottom row).
  • Figure 2: Example images from the SynthBody, SynthFace and SynthHand datasets, all of our neural networks are trained exclusively on these synthetic datasets.
  • Figure 3: Twelve tongue blendshapes included in the SOMA model, visualized with mouth opening blendshapes also activated. These blendshapes allow for significant coverage of the range-of-motion of the tongue.
  • Figure 4: Body and face shape samples. We sample at random from male, female and ungendered GMMs. The SOMA model itself has no explicit concept of gender.
  • Figure 5: Pose and expression library samples. If not all data is present in the original capture, we splice together body, hand and face pose and expression from separate sequences.
  • ...and 16 more figures