DrFER: Learning Disentangled Representations for 3D Facial Expression Recognition
Hebeizi Li, Hongyu Yang, Di Huang
TL;DR
DrFER tackles identity-expression entanglement in 3D FER by learning disentangled representations from facial point clouds with a dual-branch architecture (expression vs identity), cross-over reconstructions, and a fusion module. The model is trained in three stages with auxiliary classification, reconstruction, triplet, and cross-over losses, avoiding KL/JS regularization due to point-cloud distributions. Empirical results on BU-3DFE and Bosphorus show state-of-the-art performance among 3D-only FER methods and robustness to head pose variations, approaching 2D+3D performance with 3D data alone. The work demonstrates that disentangled latent spaces can improve expression recognition in 3D and offers a scalable framework for robust FER in real-world, pose-variant scenarios.
Abstract
Facial Expression Recognition (FER) has consistently been a focal point in the field of facial analysis. In the context of existing methodologies for 3D FER or 2D+3D FER, the extraction of expression features often gets entangled with identity information, compromising the distinctiveness of these features. To tackle this challenge, we introduce the innovative DrFER method, which brings the concept of disentangled representation learning to the field of 3D FER. DrFER employs a dual-branch framework to effectively disentangle expression information from identity information. Diverging from prior disentanglement endeavors in the 3D facial domain, we have carefully reconfigured both the loss functions and network structure to make the overall framework adaptable to point cloud data. This adaptation enhances the capability of the framework in recognizing facial expressions, even in cases involving varying head poses. Extensive evaluations conducted on the BU-3DFE and Bosphorus datasets substantiate that DrFER surpasses the performance of other 3D FER methods.
