Table of Contents
Fetching ...

Knowledge-Enhanced Facial Expression Recognition with Emotional-to-Neutral Transformation

Hangyu Li, Yihan Xu, Jiangchao Yao, Nannan Wang, Xinbo Gao, Bo Han

TL;DR

This work tackles facial expression recognition (FER) by moving beyond traditional discrete-label supervision and leveraging frozen vision-language model (VLM) text embeddings to guide representation learning. It introduces a knowledge-enhanced FER framework with an emotional-to-neutral transformation that derives a neutral representation and a self-contrast objective to tighten the alignment between text expressions and their neutral counterparts. The method uses text prompts to generate category embeddings, a transformation network to obtain neutral representations, and a combined loss Ltotal = λsLs + λtLt + λcLc, achieving consistent improvements across four challenging FER datasets (RAF-DB, AffectNet, FERPlus, CK+) with both ResNet-18 and Swin-T backbones. The results indicate that text-driven supervision not only boosts accuracy but also improves cross-dataset generalization, suggesting practical benefits for robust FER in real-world settings.

Abstract

Existing facial expression recognition (FER) methods typically fine-tune a pre-trained visual encoder using discrete labels. However, this form of supervision limits to specify the emotional concept of different facial expressions. In this paper, we observe that the rich knowledge in text embeddings, generated by vision-language models, is a promising alternative for learning discriminative facial expression representations. Inspired by this, we propose a novel knowledge-enhanced FER method with an emotional-to-neutral transformation. Specifically, we formulate the FER problem as a process to match the similarity between a facial expression representation and text embeddings. Then, we transform the facial expression representation to a neutral representation by simulating the difference in text embeddings from textual facial expression to textual neutral. Finally, a self-contrast objective is introduced to pull the facial expression representation closer to the textual facial expression, while pushing it farther from the neutral representation. We conduct evaluation with diverse pre-trained visual encoders including ResNet-18 and Swin-T on four challenging facial expression datasets. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art FER methods. The code will be publicly available.

Knowledge-Enhanced Facial Expression Recognition with Emotional-to-Neutral Transformation

TL;DR

This work tackles facial expression recognition (FER) by moving beyond traditional discrete-label supervision and leveraging frozen vision-language model (VLM) text embeddings to guide representation learning. It introduces a knowledge-enhanced FER framework with an emotional-to-neutral transformation that derives a neutral representation and a self-contrast objective to tighten the alignment between text expressions and their neutral counterparts. The method uses text prompts to generate category embeddings, a transformation network to obtain neutral representations, and a combined loss Ltotal = λsLs + λtLt + λcLc, achieving consistent improvements across four challenging FER datasets (RAF-DB, AffectNet, FERPlus, CK+) with both ResNet-18 and Swin-T backbones. The results indicate that text-driven supervision not only boosts accuracy but also improves cross-dataset generalization, suggesting practical benefits for robust FER in real-world settings.

Abstract

Existing facial expression recognition (FER) methods typically fine-tune a pre-trained visual encoder using discrete labels. However, this form of supervision limits to specify the emotional concept of different facial expressions. In this paper, we observe that the rich knowledge in text embeddings, generated by vision-language models, is a promising alternative for learning discriminative facial expression representations. Inspired by this, we propose a novel knowledge-enhanced FER method with an emotional-to-neutral transformation. Specifically, we formulate the FER problem as a process to match the similarity between a facial expression representation and text embeddings. Then, we transform the facial expression representation to a neutral representation by simulating the difference in text embeddings from textual facial expression to textual neutral. Finally, a self-contrast objective is introduced to pull the facial expression representation closer to the textual facial expression, while pushing it farther from the neutral representation. We conduct evaluation with diverse pre-trained visual encoders including ResNet-18 and Swin-T on four challenging facial expression datasets. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art FER methods. The code will be publicly available.
Paper Structure (16 sections, 11 equations, 7 figures, 5 tables)

This paper contains 16 sections, 11 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of facial expression recognition: (a) During fine-tuning ResNet-18 using discrete labels, a classifier is trained to map facial expression representations to confidence scores; (b) During fine-tuning ResNet-18 using VLM text embeddings, facial expression representations are compared with them for similarity scores. After fine-tuning, we use t-SNE maaten2008visualizing to visualize the representation distribution of testing data from RAF-DB.
  • Figure 2: Illustration of the proposed method, whose core is to match facial expression representations from the visual encoder $\mathcal{F}_{e}$ with the corresponding text embeddings from the frozen VLM text encoder $\mathcal{F}_{t}$. Firstly, we calculate the similarity score between facial expression representation $\mathbf{v}_{i}$ and text embedding $\mathbf{t}_{i}$ via a cross-entropy loss $\mathcal{L}_{s}$. Then, we transform the facial expression representation $\mathbf{v}_{i}$ to a neutral representation $\mathbf{n}_{i}$ via a network $\mathcal{F}_{n}$. To achieve this, we measure the similarity between the representation difference $\Delta\mathbf{v}$ and the embedding difference $\Delta\mathbf{t}$ via a transformation loss $\mathcal{L}_{t}$. Finally, based on an anchor $\mathbf{t}_{i}$, a positive $\mathbf{v}_{i}$, and a negative $\mathbf{n}_{i}$, a self-contrast objective $\mathcal{L}_{c}$ constrains the distance between the text-expression representation pair $(\mathbf{t}_{i},\mathbf{v}_{i})$ and the text-neutral representation pair $(\mathbf{t}_{i},\mathbf{n}_{i})$. For clarity, we present three images from RAF-DB annotated with different categories.
  • Figure 3: 2D t-SNE visualization maaten2008visualizing of facial expression representations extracted from the RAF-DB testing set using ResNet-18 in different manners, including fine-tuning via (a) $\mathcal{L}_{s}$ and (b) the combination of $\mathcal{L}_{s}$, $\mathcal{L}_{t}$, and $\mathcal{L}_{c}$.
  • Figure 4: Evaluation of different functions $\mathcal{L}_{c}$, including contrastive learning (CL) and self-contrast (SC) objective using the pre-trained Swin-T, ResNet-18, and ViT-B/16 in CLIP on (a) RAF-DB and (b) AffectNet (7 cls).
  • Figure 5: Evaluation of different forms of balancing hyper-parameters ($\lambda_{t}$:$\lambda_{c}$) using the pre-trained Swin-T on RAF-DB, AffectNet (8 cls), and FERPlus. The performance with the default setting is marked in the red.
  • ...and 2 more figures