Emotion Separation and Recognition from a Facial Expression by Generating the Poker Face with Vision Transformers

Jia Li; Jiantao Nie; Dan Guo; Richang Hong; Meng Wang

Emotion Separation and Recognition from a Facial Expression by Generating the Poker Face with Vision Transformers

Jia Li, Jiantao Nie, Dan Guo, Richang Hong, Meng Wang

TL;DR

The paper presents PF-ViT, a Vanilla Vision Transformer-based framework for in-the-wild facial expression recognition that jointly disentangles emotion from non-emotional factors by generating a poker-face counterpart. It leverages MAE-based self-supervised pretraining on unlabeled facial data to learn robust representations, then trains a PF-MAG GAN to split latent features into emotion-relevant v_e and emotion-irrelevant v_p, using a cross-fusion generator and adversarial feedback to enforce disentanglement while maintaining reconstruction fidelity. PF-ViT achieves state-of-the-art accuracy on RAF-DB, AffectNet-7/8, and FERPlus, illustrating the effectiveness of explicit emotion disentanglement and the value of unlabeled pretraining for Transformer-based FER. The approach offers a practical, data-efficient path to high-performance FER with efficient inference, since the poker-face generator can be dropped at test time and the model retains strong generalization across challenging real-world data.

Abstract

Representation learning and feature disentanglement have garnered significant research interest in the field of facial expression recognition (FER). The inherent ambiguity of emotion labels poses challenges for conventional supervised representation learning methods. Moreover, directly learning the mapping from a facial expression image to an emotion label lacks explicit supervision signals for capturing fine-grained facial features. In this paper, we propose a novel FER model, named Poker Face Vision Transformer or PF-ViT, to address these challenges. PF-ViT aims to separate and recognize the disturbance-agnostic emotion from a static facial image via generating its corresponding poker face, without the need for paired images. Inspired by the Facial Action Coding System, we regard an expressive face as the combined result of a set of facial muscle movements on one's poker face (i.e., an emotionless face). PF-ViT utilizes vanilla Vision Transformers, and its components are firstly pre-trained as Masked Autoencoders on a large facial expression dataset without emotion labels, yielding excellent representations. Subsequently, we train PF-ViT using a GAN framework. During training, the auxiliary task of poke face generation promotes the disentanglement between emotional and emotion-irrelevant components, guiding the FER model to holistically capture discriminative facial details. Quantitative and qualitative results demonstrate the effectiveness of our method, surpassing the state-of-the-art methods on four popular FER datasets.

Emotion Separation and Recognition from a Facial Expression by Generating the Poker Face with Vision Transformers

TL;DR

Abstract

Paper Structure (16 sections, 11 equations, 12 figures, 10 tables)

This paper contains 16 sections, 11 equations, 12 figures, 10 tables.

Introduction
Related Work
Feature Disentanglement for FER
Representation Learning for FER
Our Method
Overview
Preliminary: ViT Pre-Training and Our Baselines
PF-ViT: Poker Face Vision Transformer
Experiments
Datasets
Implementation Details
Ablation Studies
Comparisons with the State of the Art
Conclusion
Failure examples produced by Our PF-ViT
...and 1 more sections

Figures (12)

Figure 1: Illustration of emotion separation from expressive faces in a latent space. We assume that an expressive face is the combined outcome of a set of facial muscle movements on one’s poker face. Our method focuses on separating the emotional component and preserving all emotion-irrelevant details (i.e., disturbance) to synthesize the emotionless counterpart.
Figure 2: Accuracy vs. GFLOPs during inference -- Comparison of our FER models with SOTA FER models on the RAF-DB testing set li2017reliable regarding accuracy, computational complexity (GFLOPs) during tesing, and model size (#params). Our models use ViT-Base, ViT-Small and ViT-Tiny touvron2021training as the image encoders, with an input size of $224 \times 224$. In this analysis, our proposed PF-ViT model utilizing ViT-Base as the image encoder is denoted as PF-ViT-B, and its image generator and cross-fusion module are not included during this testing.
Figure 3: Overview of PF-ViT, which is trained in the framework of PF-MAG, a plain ViT-based GAN. Here, we reuse the pre-trained ViT encoder from our MAE pre-training stage as $E$, and also reuse the mask tokens $\boldsymbol{m}$. Similarly, the initial $G$ and $D$ are the copies of the ViT decoder used in MAE pre-training. The indentity-independent emotion representation $\boldsymbol{v_{e}}$ and emotionless representation $\boldsymbol{v_{p}}$ are orthogonal to each other. We encourage PF-ViT to reconstruct the original face by feeding $\left( \boldsymbol{v_{e}} + \boldsymbol{v_{p}} \right)$ to its image generator $G$, and to generate a realistic poker face preserving all the emotion-irrelevant detail of the input face when only $\boldsymbol{v_{p}}$ is used. PF-ViT and the discriminator $D$ are trained adversarially. During testing, PF-ViT classifies the facial expression based on $\boldsymbol{v_{e}}$.
Figure 4: Visualization of the emotion separation and poker face generation. The PokerFace and Emotion images are restored by the generator $G$ of PF-ViT from the disentangled components $\boldsymbol{v_{e}}$ and $\boldsymbol{v_{p}}$, respectively. Left: both visual tokens and mask tokens are fed into the generator $G$. By comparing the original expressive face and synthesized poker face, we can see the discrepancy in terms of emotional details. Note that we train PF-ViT only using unpaired images. Right: apart from the visual tokens produced by the token separator $S$, mask tokens are also crucial for the generator $G$ to synthesize satisfactory poker faces.
Figure 5: Confusion matrices of our PF-ViT on the RAF-DB, AffectNet-7, AffectNet-8 and FERPlus testing sets. The average recognition rates across expression classes are 92.07%, 67.23%, 64.10% and 91.16% respectively.
...and 7 more figures

Emotion Separation and Recognition from a Facial Expression by Generating the Poker Face with Vision Transformers

TL;DR

Abstract

Emotion Separation and Recognition from a Facial Expression by Generating the Poker Face with Vision Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (12)