Knowledge-Enhanced Facial Expression Recognition with Emotional-to-Neutral Transformation
Hangyu Li, Yihan Xu, Jiangchao Yao, Nannan Wang, Xinbo Gao, Bo Han
TL;DR
This work tackles facial expression recognition (FER) by moving beyond traditional discrete-label supervision and leveraging frozen vision-language model (VLM) text embeddings to guide representation learning. It introduces a knowledge-enhanced FER framework with an emotional-to-neutral transformation that derives a neutral representation and a self-contrast objective to tighten the alignment between text expressions and their neutral counterparts. The method uses text prompts to generate category embeddings, a transformation network to obtain neutral representations, and a combined loss Ltotal = λsLs + λtLt + λcLc, achieving consistent improvements across four challenging FER datasets (RAF-DB, AffectNet, FERPlus, CK+) with both ResNet-18 and Swin-T backbones. The results indicate that text-driven supervision not only boosts accuracy but also improves cross-dataset generalization, suggesting practical benefits for robust FER in real-world settings.
Abstract
Existing facial expression recognition (FER) methods typically fine-tune a pre-trained visual encoder using discrete labels. However, this form of supervision limits to specify the emotional concept of different facial expressions. In this paper, we observe that the rich knowledge in text embeddings, generated by vision-language models, is a promising alternative for learning discriminative facial expression representations. Inspired by this, we propose a novel knowledge-enhanced FER method with an emotional-to-neutral transformation. Specifically, we formulate the FER problem as a process to match the similarity between a facial expression representation and text embeddings. Then, we transform the facial expression representation to a neutral representation by simulating the difference in text embeddings from textual facial expression to textual neutral. Finally, a self-contrast objective is introduced to pull the facial expression representation closer to the textual facial expression, while pushing it farther from the neutral representation. We conduct evaluation with diverse pre-trained visual encoders including ResNet-18 and Swin-T on four challenging facial expression datasets. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art FER methods. The code will be publicly available.
