Table of Contents
Fetching ...

Let's Get the FACS Straight -- Reconstructing Obstructed Facial Features

Tim Büchner, Sven Sickert, Gerd Fabian Volk, Christoph Anders, Orlando Guntinas-Lichius, Joachim Denzler

TL;DR

The paper tackles obstructed facial analysis by removing sEMG sensor obstructions from video frames using unpaired CycleGAN-style translation, avoiding repeated fine-tuning for each task. By treating sensor presence as a style shift, the authors reconstruct clean facial features (via $G_{S \mapsto N}$) while preserving identity and expression, enabling downstream AU and emotion analyses. Quantitative perceptual metrics (LPIPS, FID) and downstream tasks (AU with RDF/JAA-NET, emotion detection with ResMaskNet) show restoration quality approaching, and sometimes exceeding, the baseline unobstructed videos. This approach facilitates applying existing facial analysis methods to obstructed data, with subject-specific models offering robustness across individuals and recording conditions.

Abstract

The human face is one of the most crucial parts in interhuman communication. Even when parts of the face are hidden or obstructed the underlying facial movements can be understood. Machine learning approaches often fail in that regard due to the complexity of the facial structures. To alleviate this problem a common approach is to fine-tune a model for such a specific application. However, this is computational intensive and might have to be repeated for each desired analysis task. In this paper, we propose to reconstruct obstructed facial parts to avoid the task of repeated fine-tuning. As a result, existing facial analysis methods can be used without further changes with respect to the data. In our approach, the restoration of facial features is interpreted as a style transfer task between different recording setups. By using the CycleGAN architecture the requirement of matched pairs, which is often hard to fullfill, can be eliminated. To proof the viability of our approach, we compare our reconstructions with real unobstructed recordings. We created a novel data set in which 36 test subjects were recorded both with and without 62 surface electromyography sensors attached to their faces. In our evaluation, we feature typical facial analysis tasks, like the computation of Facial Action Units and the detection of emotions. To further assess the quality of the restoration, we also compare perceptional distances. We can show, that scores similar to the videos without obstructing sensors can be achieved.

Let's Get the FACS Straight -- Reconstructing Obstructed Facial Features

TL;DR

The paper tackles obstructed facial analysis by removing sEMG sensor obstructions from video frames using unpaired CycleGAN-style translation, avoiding repeated fine-tuning for each task. By treating sensor presence as a style shift, the authors reconstruct clean facial features (via ) while preserving identity and expression, enabling downstream AU and emotion analyses. Quantitative perceptual metrics (LPIPS, FID) and downstream tasks (AU with RDF/JAA-NET, emotion detection with ResMaskNet) show restoration quality approaching, and sometimes exceeding, the baseline unobstructed videos. This approach facilitates applying existing facial analysis methods to obstructed data, with subject-specific models offering robustness across individuals and recording conditions.

Abstract

The human face is one of the most crucial parts in interhuman communication. Even when parts of the face are hidden or obstructed the underlying facial movements can be understood. Machine learning approaches often fail in that regard due to the complexity of the facial structures. To alleviate this problem a common approach is to fine-tune a model for such a specific application. However, this is computational intensive and might have to be repeated for each desired analysis task. In this paper, we propose to reconstruct obstructed facial parts to avoid the task of repeated fine-tuning. As a result, existing facial analysis methods can be used without further changes with respect to the data. In our approach, the restoration of facial features is interpreted as a style transfer task between different recording setups. By using the CycleGAN architecture the requirement of matched pairs, which is often hard to fullfill, can be eliminated. To proof the viability of our approach, we compare our reconstructions with real unobstructed recordings. We created a novel data set in which 36 test subjects were recorded both with and without 62 surface electromyography sensors attached to their faces. In our evaluation, we feature typical facial analysis tasks, like the computation of Facial Action Units and the detection of emotions. To further assess the quality of the restoration, we also compare perceptional distances. We can show, that scores similar to the videos without obstructing sensors can be achieved.
Paper Structure (13 sections, 9 figures, 3 tables)

This paper contains 13 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Our experimental setup to evaluate the correct restoration of facial features: A video without sEMG sensors represents our baseline (normal). For comparison we have videos with those sensors visible (sensor) and videos where they have been removed by our proposed approach (clean). Our evaluation includes the tasks of extracting Facial Actions Units and emotion detection. Furthermore, we analyze their perceptual similarity in comparison to the baseline. Green check marks, red crosses and yellow question marks indicate similarity or the possibility to solve a task given the underlying data.
  • Figure 2: Overview of three selected test subjects with their three measurements on each of the two recording dates. For each subject one recording without attached sensors and two with attached sensors is displayed. The 62 sEMG sensors are attached to the same anatomical locations for all test subjects. The sensors block relevant facial areas, such as the forehead, completely.
  • Figure 3: Double generative structure of the CycleGAN for the proposed sEMG sensor removal. Generator $G_{N \mapsto S}$ learns the attaching of the sensors. Generator $G_{S \mapsto N}$ learns the detaching of the sensors. Different facial expressions can be combined without being changed during the translation process.
  • Figure 4: We display the trainings progress of the sEMG sensor removal. During the first $5$ epochs the model focuses on the general removal of the sensors. After that, the more fine-grained details in the faces are restored.
  • Figure 5: Overview of two test subjects with their respective sEMG sensor removal. The covered facial features were restored in all examples.
  • ...and 4 more figures