Emotion Separation and Recognition from a Facial Expression by Generating the Poker Face with Vision Transformers
Jia Li, Jiantao Nie, Dan Guo, Richang Hong, Meng Wang
TL;DR
The paper presents PF-ViT, a Vanilla Vision Transformer-based framework for in-the-wild facial expression recognition that jointly disentangles emotion from non-emotional factors by generating a poker-face counterpart. It leverages MAE-based self-supervised pretraining on unlabeled facial data to learn robust representations, then trains a PF-MAG GAN to split latent features into emotion-relevant v_e and emotion-irrelevant v_p, using a cross-fusion generator and adversarial feedback to enforce disentanglement while maintaining reconstruction fidelity. PF-ViT achieves state-of-the-art accuracy on RAF-DB, AffectNet-7/8, and FERPlus, illustrating the effectiveness of explicit emotion disentanglement and the value of unlabeled pretraining for Transformer-based FER. The approach offers a practical, data-efficient path to high-performance FER with efficient inference, since the poker-face generator can be dropped at test time and the model retains strong generalization across challenging real-world data.
Abstract
Representation learning and feature disentanglement have garnered significant research interest in the field of facial expression recognition (FER). The inherent ambiguity of emotion labels poses challenges for conventional supervised representation learning methods. Moreover, directly learning the mapping from a facial expression image to an emotion label lacks explicit supervision signals for capturing fine-grained facial features. In this paper, we propose a novel FER model, named Poker Face Vision Transformer or PF-ViT, to address these challenges. PF-ViT aims to separate and recognize the disturbance-agnostic emotion from a static facial image via generating its corresponding poker face, without the need for paired images. Inspired by the Facial Action Coding System, we regard an expressive face as the combined result of a set of facial muscle movements on one's poker face (i.e., an emotionless face). PF-ViT utilizes vanilla Vision Transformers, and its components are firstly pre-trained as Masked Autoencoders on a large facial expression dataset without emotion labels, yielding excellent representations. Subsequently, we train PF-ViT using a GAN framework. During training, the auxiliary task of poke face generation promotes the disentanglement between emotional and emotion-irrelevant components, guiding the FER model to holistically capture discriminative facial details. Quantitative and qualitative results demonstrate the effectiveness of our method, surpassing the state-of-the-art methods on four popular FER datasets.
