Table of Contents
Fetching ...

DrFER: Learning Disentangled Representations for 3D Facial Expression Recognition

Hebeizi Li, Hongyu Yang, Di Huang

TL;DR

DrFER tackles identity-expression entanglement in 3D FER by learning disentangled representations from facial point clouds with a dual-branch architecture (expression vs identity), cross-over reconstructions, and a fusion module. The model is trained in three stages with auxiliary classification, reconstruction, triplet, and cross-over losses, avoiding KL/JS regularization due to point-cloud distributions. Empirical results on BU-3DFE and Bosphorus show state-of-the-art performance among 3D-only FER methods and robustness to head pose variations, approaching 2D+3D performance with 3D data alone. The work demonstrates that disentangled latent spaces can improve expression recognition in 3D and offers a scalable framework for robust FER in real-world, pose-variant scenarios.

Abstract

Facial Expression Recognition (FER) has consistently been a focal point in the field of facial analysis. In the context of existing methodologies for 3D FER or 2D+3D FER, the extraction of expression features often gets entangled with identity information, compromising the distinctiveness of these features. To tackle this challenge, we introduce the innovative DrFER method, which brings the concept of disentangled representation learning to the field of 3D FER. DrFER employs a dual-branch framework to effectively disentangle expression information from identity information. Diverging from prior disentanglement endeavors in the 3D facial domain, we have carefully reconfigured both the loss functions and network structure to make the overall framework adaptable to point cloud data. This adaptation enhances the capability of the framework in recognizing facial expressions, even in cases involving varying head poses. Extensive evaluations conducted on the BU-3DFE and Bosphorus datasets substantiate that DrFER surpasses the performance of other 3D FER methods.

DrFER: Learning Disentangled Representations for 3D Facial Expression Recognition

TL;DR

DrFER tackles identity-expression entanglement in 3D FER by learning disentangled representations from facial point clouds with a dual-branch architecture (expression vs identity), cross-over reconstructions, and a fusion module. The model is trained in three stages with auxiliary classification, reconstruction, triplet, and cross-over losses, avoiding KL/JS regularization due to point-cloud distributions. Empirical results on BU-3DFE and Bosphorus show state-of-the-art performance among 3D-only FER methods and robustness to head pose variations, approaching 2D+3D performance with 3D data alone. The work demonstrates that disentangled latent spaces can improve expression recognition in 3D and offers a scalable framework for robust FER in real-world, pose-variant scenarios.

Abstract

Facial Expression Recognition (FER) has consistently been a focal point in the field of facial analysis. In the context of existing methodologies for 3D FER or 2D+3D FER, the extraction of expression features often gets entangled with identity information, compromising the distinctiveness of these features. To tackle this challenge, we introduce the innovative DrFER method, which brings the concept of disentangled representation learning to the field of 3D FER. DrFER employs a dual-branch framework to effectively disentangle expression information from identity information. Diverging from prior disentanglement endeavors in the 3D facial domain, we have carefully reconfigured both the loss functions and network structure to make the overall framework adaptable to point cloud data. This adaptation enhances the capability of the framework in recognizing facial expressions, even in cases involving varying head poses. Extensive evaluations conducted on the BU-3DFE and Bosphorus datasets substantiate that DrFER surpasses the performance of other 3D FER methods.
Paper Structure (21 sections, 7 equations, 6 figures, 3 tables)

This paper contains 21 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Illustrations of 3D facial shape space. Human expressions occupy manifolds within a high-dimensional space, which exhibit similar patterns across different individuals and the center of each expression manifold corresponds to the neutral expression.
  • Figure 2: Method overview. The proposed DrFER model comprises two key components: the disentangling component and the fusion component. The former employs a dual-branch architecture to explicitly learn expression and identity features, and generate the corresponding de-identity and de-expression faces, respectively. The model subsequently recombines these disentangled faces in a cross-over manner and reconstruct the original face with the fusion component, facilitating the disentanglement process. To guide the training process effectively, a series of training losses are employed, including those specifically tailored for point cloud data. The training stages corresponding to each module are labeled with lowercase Roman numerals in the figure.
  • Figure 3: Detailed architectures of the proposed expression/identity branch, the cross-modal fusion module, and the classifier.
  • Figure 4: Visualization of the rotated faces. The faces in the top two rows and the bottom two rows are from two different randomly selected subjects. Columns indicate different rotation angles. Faces in the same column have the same rotation angle. The numbers below each facial scan represent the points that are preserved after point cloud rotation.
  • Figure 5: Comparative results of the rotation experiment between the disentangled and the baseline methods. The results of these two methods are plotted as solid and dashed lines, where blue and red colors represent the pitch and yaw rotations, respectively.
  • ...and 1 more figures