Table of Contents
Fetching ...

Generalizable Facial Expression Recognition

Yuhang Zhang, Xiuqi Zheng, Chenyi Liang, Jiani Hu, Weihong Deng

TL;DR

This work tackles the zero-shot generalization problem in facial expression recognition (FER) under domain shifts, where target-domain data for fine-tuning are unavailable. It proposes a CLIP-based, fixed-face-feature pipeline called CAFE that learns sigmoid masks to selectively extract expression-related cues, preserving generalization while leveraging FER precision. A channel-separation mechanism, coupled with a channel-diverse loss, regularizes the masks to be expression-specific yet robust across unseen domains, and the model avoids a heavy FC layer to reduce overfitting. Extensive experiments on five FER datasets show that GFER consistently outperforms state-of-the-art methods on unseen test sets, demonstrating strong zero-shot cross-domain generalization and practical potential for real-world deployment.

Abstract

SOTA facial expression recognition (FER) methods fail on test sets that have domain gaps with the train set. Recent domain adaptation FER methods need to acquire labeled or unlabeled samples of target domains to fine-tune the FER model, which might be infeasible in real-world deployment. In this paper, we aim to improve the zero-shot generalization ability of FER methods on different unseen test sets using only one train set. Inspired by how humans first detect faces and then select expression features, we propose a novel FER pipeline to extract expression-related features from any given face images. Our method is based on the generalizable face features extracted by large models like CLIP. However, it is non-trivial to adapt the general features of CLIP for specific tasks like FER. To preserve the generalization ability of CLIP and the high precision of the FER model, we design a novel approach that learns sigmoid masks based on the fixed CLIP face features to extract expression features. To further improve the generalization ability on unseen test sets, we separate the channels of the learned masked features according to the expression classes to directly generate logits and avoid using the FC layer to reduce overfitting. We also introduce a channel-diverse loss to make the learned masks separated. Extensive experiments on five different FER datasets verify that our method outperforms SOTA FER methods by large margins. Code is available in https://github.com/zyh-uaiaaaa/Generalizable-FER.

Generalizable Facial Expression Recognition

TL;DR

This work tackles the zero-shot generalization problem in facial expression recognition (FER) under domain shifts, where target-domain data for fine-tuning are unavailable. It proposes a CLIP-based, fixed-face-feature pipeline called CAFE that learns sigmoid masks to selectively extract expression-related cues, preserving generalization while leveraging FER precision. A channel-separation mechanism, coupled with a channel-diverse loss, regularizes the masks to be expression-specific yet robust across unseen domains, and the model avoids a heavy FC layer to reduce overfitting. Extensive experiments on five FER datasets show that GFER consistently outperforms state-of-the-art methods on unseen test sets, demonstrating strong zero-shot cross-domain generalization and practical potential for real-world deployment.

Abstract

SOTA facial expression recognition (FER) methods fail on test sets that have domain gaps with the train set. Recent domain adaptation FER methods need to acquire labeled or unlabeled samples of target domains to fine-tune the FER model, which might be infeasible in real-world deployment. In this paper, we aim to improve the zero-shot generalization ability of FER methods on different unseen test sets using only one train set. Inspired by how humans first detect faces and then select expression features, we propose a novel FER pipeline to extract expression-related features from any given face images. Our method is based on the generalizable face features extracted by large models like CLIP. However, it is non-trivial to adapt the general features of CLIP for specific tasks like FER. To preserve the generalization ability of CLIP and the high precision of the FER model, we design a novel approach that learns sigmoid masks based on the fixed CLIP face features to extract expression features. To further improve the generalization ability on unseen test sets, we separate the channels of the learned masked features according to the expression classes to directly generate logits and avoid using the FC layer to reduce overfitting. We also introduce a channel-diverse loss to make the learned masks separated. Extensive experiments on five different FER datasets verify that our method outperforms SOTA FER methods by large margins. Code is available in https://github.com/zyh-uaiaaaa/Generalizable-FER.
Paper Structure (20 sections, 9 equations, 5 figures, 5 tables)

This paper contains 20 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The illustration of Generalizable Facial Expression Recognition (GFER). In order to evaluate the generalization ability of FER methods, we train FER models on one train set and test the trained models on different unseen test sets. The difference between our task and domain adaptation FER is that we only use one train set and do not acquire any labeled or unlabeled samples from the target domain. SOTA FER methods like EAC do not work well on unseen test sets, which show low generalization ability. Our method outperforms EAC by large margins on different unseen FER test sets, showing better generalization ability.
  • Figure 2: The framework of our proposed method CAFE. We utilize a fixed pre-trained large model, such as CLIP to extract fixed face features regards of the input training images, the FER model is trained to learn a mask for the fixed face features to only extract the expression-related features. Notice that this is similar to how human perceives expressions: we first observe faces and then extract expression-relevant features. The learned mask is regularized by a sigmoid function to prevent overfitting. We further introduce channel-separation and channel-diverse to make the learned mask diverse to improve the generalization ability.
  • Figure 3: The hyperparameter study of our method. Our method is not very sensitive to the two hyperparameters and we could choose them from a wide range. For simplicity, we use $\lambda = 1.5$ and $\beta = 5$ across all experiments. The cases when $\lambda$ is $0$ or when $\beta$ is $0$ are studied in \ref{['ablation']}.
  • Figure 4: The extracted features of FERPlus test samples by the RAF-DB trained model of EAC and our method. Labels are displayed in black, correct predictions are displayed in blue, and incorrect predictions in red.
  • Figure 5: Visualization of the sigmoid masks.