Table of Contents
Fetching ...

eMotion-GAN: A Motion-based GAN for Photorealistic and Facial Expression Preserving Frontal View Synthesis

Omar Ikne, Benjamin Allaert, Ioan Marius Bilasco, Hazem Wannous

TL;DR

The paper tackles FER performance degradation due to head pose variations by introducing eMotion-GAN, a motion-domain frontal view synthesis framework. It decomposes the task into two stages: frontalizing facial motion with $G_{F}$ (guided by $D_{P}$ and $D_{E}$) and warping this motion into an expressive frontal face via $G_{W}$ (monitored by $e_{FER}$). Key contributions include a novel flow-based frontalization approach, a dual-discriminator setup to preserve motion and expressions, and an end-to-end training scheme augmented by EndPoint and Charbonnier losses. Empirical results on SNaP-2DFe and auxiliary FER datasets show reduced FER gaps between frontal and non-frontal faces, with up to +$5$% to +$20$% FER improvements in motion, and demonstrate cross-subject motion transfer capabilities and meaningful ablations. The work advances robust FER under real-world pose variations and offers potential data augmentation and synthesis utilities, while acknowledging misuse risks in deepfake contexts.

Abstract

Many existing facial expression recognition (FER) systems encounter substantial performance degradation when faced with variations in head pose. Numerous frontalization methods have been proposed to enhance these systems' performance under such conditions. However, they often introduce undesirable deformations, rendering them less suitable for precise facial expression analysis. In this paper, we present eMotion-GAN, a novel deep learning approach designed for frontal view synthesis while preserving facial expressions within the motion domain. Considering the motion induced by head variation as noise and the motion induced by facial expression as the relevant information, our model is trained to filter out the noisy motion in order to retain only the motion related to facial expression. The filtered motion is then mapped onto a neutral frontal face to generate the corresponding expressive frontal face. We conducted extensive evaluations using several widely recognized dynamic FER datasets, which encompass sequences exhibiting various degrees of head pose variations in both intensity and orientation. Our results demonstrate the effectiveness of our approach in significantly reducing the FER performance gap between frontal and non-frontal faces. Specifically, we achieved a FER improvement of up to +5\% for small pose variations and up to +20\% improvement for larger pose variations. Code available at \url{https://github.com/o-ikne/eMotion-GAN.git}.

eMotion-GAN: A Motion-based GAN for Photorealistic and Facial Expression Preserving Frontal View Synthesis

TL;DR

The paper tackles FER performance degradation due to head pose variations by introducing eMotion-GAN, a motion-domain frontal view synthesis framework. It decomposes the task into two stages: frontalizing facial motion with (guided by and ) and warping this motion into an expressive frontal face via (monitored by ). Key contributions include a novel flow-based frontalization approach, a dual-discriminator setup to preserve motion and expressions, and an end-to-end training scheme augmented by EndPoint and Charbonnier losses. Empirical results on SNaP-2DFe and auxiliary FER datasets show reduced FER gaps between frontal and non-frontal faces, with up to +% to +% FER improvements in motion, and demonstrate cross-subject motion transfer capabilities and meaningful ablations. The work advances robust FER under real-world pose variations and offers potential data augmentation and synthesis utilities, while acknowledging misuse risks in deepfake contexts.

Abstract

Many existing facial expression recognition (FER) systems encounter substantial performance degradation when faced with variations in head pose. Numerous frontalization methods have been proposed to enhance these systems' performance under such conditions. However, they often introduce undesirable deformations, rendering them less suitable for precise facial expression analysis. In this paper, we present eMotion-GAN, a novel deep learning approach designed for frontal view synthesis while preserving facial expressions within the motion domain. Considering the motion induced by head variation as noise and the motion induced by facial expression as the relevant information, our model is trained to filter out the noisy motion in order to retain only the motion related to facial expression. The filtered motion is then mapped onto a neutral frontal face to generate the corresponding expressive frontal face. We conducted extensive evaluations using several widely recognized dynamic FER datasets, which encompass sequences exhibiting various degrees of head pose variations in both intensity and orientation. Our results demonstrate the effectiveness of our approach in significantly reducing the FER performance gap between frontal and non-frontal faces. Specifically, we achieved a FER improvement of up to +5\% for small pose variations and up to +20\% improvement for larger pose variations. Code available at \url{https://github.com/o-ikne/eMotion-GAN.git}.
Paper Structure (30 sections, 5 equations, 6 figures, 4 tables)

This paper contains 30 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: eMotion-GAN estimates the motion induced by the variation of the head pose (column 5), to keep only the motion induced by the facial muscles. The latter is then transposed in the frontal plane (column 6) to match the ground truth (column 4) and warped into a neutral face (column 7) to assist in FER. The network training is based on a synchronous capture system (SNaP-2DFe allaert2018impact), where facial motion is computed simultaneously in the absence (cam1) and presence (cam2) of head pose variation.
  • Figure 2: Setup used in datasets to generate head pose variations and provide ground truth of facial movement and expression. GT: Ground Truth; P: Pose.
  • Figure 3: eMotion-GAN approach. $1^{st}$ phase: motion frontalization. Given an input optical flow ($y$), the generator $(G_{F})$ estimates and filters the motion induced by the variation of the head pose and transposes the motion induced by the facial expression in the frontal plane ($\tilde{y}^*$) to approximate the real motion ($y^*$). Besides the reconstruction losses, the expression discriminator $D_{E}$ is introduced to force $G_{F}$ to preserve the facial expression through the loss $\mathcal{L}_{e}$. The discriminator $D_{P}$ ensures that the properties of the facial movement (e.g. intensity and direction) are preserved through the $\mathcal{L}_{adv}$ loss. $2^{nd}$ phase: motion warping. Given a neutral face $(x_t^*)$ and the frontalized motion field $\tilde{y}^*$, the generator $G_{W}$ generates the corresponding expressive face. A pre-trained classifier $e_{FER}$ predicts the corresponding expression.
  • Figure 4: Cross-validation evaluation protocol. A: training the end-to-end model on training folds. B: frontalizing flows and warping images of the test fold using the trained models. C: evaluating performances using FER models on the reconstructed test fold in both motion and image domains.
  • Figure 5: Qualitative comparison of different FVS methods in image domain. Our approach effectively handles large variations for frontal view synthesis while preserving facial expressions.
  • ...and 1 more figures