Table of Contents
Fetching ...

3D Face Tracking from 2D Video through Iterative Dense UV to Image Flow

Felix Taubner, Prashant Raina, Mathieu Tuli, Eu Wern Teh, Chul Lee, Jinmiao Huang

TL;DR

A novel face tracker that in-troduces an innovative 2D alignment network for dense pervertex alignment and is trained on high-quality 3D scan annotations rather than weak supervision or synthetic data is proposed.

Abstract

When working with 3D facial data, improving fidelity and avoiding the uncanny valley effect is critically dependent on accurate 3D facial performance capture. Because such methods are expensive and due to the widespread availability of 2D videos, recent methods have focused on how to perform monocular 3D face tracking. However, these methods often fall short in capturing precise facial movements due to limitations in their network architecture, training, and evaluation processes. Addressing these challenges, we propose a novel face tracker, FlowFace, that introduces an innovative 2D alignment network for dense per-vertex alignment. Unlike prior work, FlowFace is trained on high-quality 3D scan annotations rather than weak supervision or synthetic data. Our 3D model fitting module jointly fits a 3D face model from one or many observations, integrating existing neutral shape priors for enhanced identity and expression disentanglement and per-vertex deformations for detailed facial feature reconstruction. Additionally, we propose a novel metric and benchmark for assessing tracking accuracy. Our method exhibits superior performance on both custom and publicly available benchmarks. We further validate the effectiveness of our tracker by generating high-quality 3D data from 2D videos, which leads to performance gains on downstream tasks.

3D Face Tracking from 2D Video through Iterative Dense UV to Image Flow

TL;DR

A novel face tracker that in-troduces an innovative 2D alignment network for dense pervertex alignment and is trained on high-quality 3D scan annotations rather than weak supervision or synthetic data is proposed.

Abstract

When working with 3D facial data, improving fidelity and avoiding the uncanny valley effect is critically dependent on accurate 3D facial performance capture. Because such methods are expensive and due to the widespread availability of 2D videos, recent methods have focused on how to perform monocular 3D face tracking. However, these methods often fall short in capturing precise facial movements due to limitations in their network architecture, training, and evaluation processes. Addressing these challenges, we propose a novel face tracker, FlowFace, that introduces an innovative 2D alignment network for dense per-vertex alignment. Unlike prior work, FlowFace is trained on high-quality 3D scan annotations rather than weak supervision or synthetic data. Our 3D model fitting module jointly fits a 3D face model from one or many observations, integrating existing neutral shape priors for enhanced identity and expression disentanglement and per-vertex deformations for detailed facial feature reconstruction. Additionally, we propose a novel metric and benchmark for assessing tracking accuracy. Our method exhibits superior performance on both custom and publicly available benchmarks. We further validate the effectiveness of our tracker by generating high-quality 3D data from 2D videos, which leads to performance gains on downstream tasks.
Paper Structure (56 sections, 17 equations, 25 figures, 7 tables)

This paper contains 56 sections, 17 equations, 25 figures, 7 tables.

Figures (25)

  • Figure 1: An overview of the proposed 2D alignment network architecture. A feature encoder transforms the image into a latent feature map that is then iteratively aligned with a learned UV positional embedding map by the recurrent update block.
  • Figure 2: An illustration of the 3D model fitting process.
  • Figure 3: An illustration of the EPE computation for each frame.
  • Figure 4: $\textit{SSME}_h$ plotted over all frame horizons for each evaluated tracker for single-image and full sequence tracking (right). Lower $\textit{SSME}_h$ in smaller frame horizons $h$ (left in the graph) means short-term temporal stability while lower $\textit{SSME}_h$ in larger frame horizons (right in the graph) means better long-term tracking consistency. Our tracker performs significantly better over every time horizon.
  • Figure 5: Qualitative results on two sequences (top and bottom 3 rows) of our Multiface benchmark. Warmer colors represent high error, while colder colors represent low error. DECA deca, HRN hrn, and MPT mica struggle with motion in the cheek and forehead region, which is visible in the SSME error plot (right columns). Despite using only 2D alignment as supervision, our method achieves a better 3D reconstruction (CD) (center columns).
  • ...and 20 more figures