Table of Contents
Fetching ...

Enhancing Zero-Shot Facial Expression Recognition by LLM Knowledge Transfer

Zengqun Zhao, Yu Cao, Shaogang Gong, Ioannis Patras

TL;DR

A novel method is proposed, Exp-CLIP, to enhance zero-shot FER by transferring the task knowledge from large language models (LLMs), which achieves superior zero-shot results to the CLIP models and several other large vision-language models (LVLMs) on seven in-the-wild FER datasets.

Abstract

Current facial expression recognition (FER) models are often designed in a supervised learning manner and thus are constrained by the lack of large-scale facial expression images with high-quality annotations. Consequently, these models often fail to generalize well, performing poorly on unseen images in inference. Vision-language-based zero-shot models demonstrate a promising potential for addressing such challenges. However, these models lack task-specific knowledge and therefore are not optimized for the nuances of recognizing facial expressions. To bridge this gap, this work proposes a novel method, Exp-CLIP, to enhance zero-shot FER by transferring the task knowledge from large language models (LLMs). Specifically, based on the pre-trained vision-language encoders, we incorporate a projection head designed to map the initial joint vision-language space into a space that captures representations of facial actions. To train this projection head for subsequent zero-shot predictions, we propose to align the projected visual representations with task-specific semantic meanings derived from the LLM encoder, and the text instruction-based strategy is employed to customize the LLM knowledge. Given unlabelled facial data and efficient training of the projection head, Exp-CLIP achieves superior zero-shot results to the CLIP models and several other large vision-language models (LVLMs) on seven in-the-wild FER datasets. The code and pre-trained models are available at https://github.com/zengqunzhao/Exp-CLIP.

Enhancing Zero-Shot Facial Expression Recognition by LLM Knowledge Transfer

TL;DR

A novel method is proposed, Exp-CLIP, to enhance zero-shot FER by transferring the task knowledge from large language models (LLMs), which achieves superior zero-shot results to the CLIP models and several other large vision-language models (LVLMs) on seven in-the-wild FER datasets.

Abstract

Current facial expression recognition (FER) models are often designed in a supervised learning manner and thus are constrained by the lack of large-scale facial expression images with high-quality annotations. Consequently, these models often fail to generalize well, performing poorly on unseen images in inference. Vision-language-based zero-shot models demonstrate a promising potential for addressing such challenges. However, these models lack task-specific knowledge and therefore are not optimized for the nuances of recognizing facial expressions. To bridge this gap, this work proposes a novel method, Exp-CLIP, to enhance zero-shot FER by transferring the task knowledge from large language models (LLMs). Specifically, based on the pre-trained vision-language encoders, we incorporate a projection head designed to map the initial joint vision-language space into a space that captures representations of facial actions. To train this projection head for subsequent zero-shot predictions, we propose to align the projected visual representations with task-specific semantic meanings derived from the LLM encoder, and the text instruction-based strategy is employed to customize the LLM knowledge. Given unlabelled facial data and efficient training of the projection head, Exp-CLIP achieves superior zero-shot results to the CLIP models and several other large vision-language models (LVLMs) on seven in-the-wild FER datasets. The code and pre-trained models are available at https://github.com/zengqunzhao/Exp-CLIP.
Paper Structure (21 sections, 3 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 21 sections, 3 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: CLIP model learned more general feature representations, lacking task-specific knowledge and therefore are not optimized for the nuances of recognizing facial expressions.
  • Figure 2: The proposed framework in this paper introduces contrastive pre-training and zero-shot FER. In the testing phase, a learned projection head is employed to enhance image-text representations for facial expressions. This projection head is learned in an unsupervised manner, leveraging the knowledge from LLMs. The I2T module consists of a ViT, a Q-Former li2023blip, and a projector, which is adopted to map the images into LLM tokens.
  • Figure 3: CLIP latent visual feature distribution on FERPlus dataset. Both results are based on ViT-L-14.
  • Figure A: Confusion matrix compared with CLIP model on static FER datasets.
  • Figure B: Confusion matrix compared with CLIP model on dynamic FER datasets.
  • ...and 3 more figures