Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting

Yuanyuan Liu; Yuxuan Huang; Shuyang Liu; Yibing Zhan; Zijing Chen; Zhe Chen

Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting

Yuanyuan Liu, Yuxuan Huang, Shuyang Liu, Yibing Zhan, Zijing Chen, Zhe Chen

TL;DR

This work addresses open-set video-based facial expression recognition (OV-FER) by enabling recognition of both known and unseen expressions in video data. It introduces Human Expression-Sensitive Prompting (HESP), a CLIP-based framework comprising textual prompting with learnable tokens, visual prompting with expression-sensitive attention and a CAM-based mask, and an open-set multi-task learning scheme that fosters cross-modal interaction. The approach leverages negative representations and a combination of losses to push known classes apart while highlighting unknown patterns, yielding large improvements in AUROC and OSCR across four OV-FER task settings. Empirical results on AFEW and MAFW demonstrate HESP's strong generalization and efficiency, suggesting significant practical impact for robust open-set video emotion recognition and potential extensions to multimodal open-set problems.

Abstract

In Video-based Facial Expression Recognition (V-FER), models are typically trained on closed-set datasets with a fixed number of known classes. However, these models struggle with unknown classes common in real-world scenarios. In this paper, we introduce a challenging Open-set Video-based Facial Expression Recognition (OV-FER) task, aiming to identify both known and new, unseen facial expressions. While existing approaches use large-scale vision-language models like CLIP to identify unseen classes, we argue that these methods may not adequately capture the subtle human expressions needed for OV-FER. To address this limitation, we propose a novel Human Expression-Sensitive Prompting (HESP) mechanism to significantly enhance CLIP's ability to model video-based facial expression details effectively. Our proposed HESP comprises three components: 1) a textual prompting module with learnable prompts to enhance CLIP's textual representation of both known and unknown emotions, 2) a visual prompting module that encodes temporal emotional information from video frames using expression-sensitive attention, equipping CLIP with a new visual modeling ability to extract emotion-rich information, and 3) an open-set multi-task learning scheme that promotes interaction between the textual and visual modules, improving the understanding of novel human emotions in video sequences. Extensive experiments conducted on four OV-FER task settings demonstrate that HESP can significantly boost CLIP's performance (a relative improvement of 17.93% on AUROC and 106.18% on OSCR) and outperform other state-of-the-art open-set video understanding methods by a large margin. Code is available at https://github.com/cosinehuang/HESP.

Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting

TL;DR

Abstract

Paper Structure (36 sections, 14 equations, 5 figures, 7 tables)

This paper contains 36 sections, 14 equations, 5 figures, 7 tables.

Introduction
Related work
Method
Preliminary
Problem Definition
Negative Representations for Open-set Problems
CLIP Model
Overview of HESP
Textual Prompting
Textual Prompting for Closed-set Data
Negative Textual Prompting for Open-set Data
Visual Prompting
Visual Prompting for Closed-set Data
Negative Visual Prompting for Open-set Data
Optimization: Open-set Multi-task Learning
...and 21 more sections

Figures (5)

Figure 1: The motivation and intuitive results of our HESP for OV-FER. Affected by challenges in OV-FER, current methods struggle to capture effective expressive information due to subtle inter-video and intra emotion changes and differences. To address this limitation, HESP consists of a novel learnable textual prompting module, visual prompting module, and an open-set multi-task learning scheme, aiming to augment CLIP to obtain more expression-sensitive representations for both known and unknown emotions in OV-FER.
Figure 2: The pipeline of HESP for OV-FER. HESP first combines textual and visual prompting modules to enhance CLIP for modelling facial expression-sensitive information. Then, an open-set multi-task learning scheme is devised to facilitate interactions between these modules, improving OV-FER performance by exploring both known and unknown emotion cues.
Figure 3: Various settings in textual and visual prompting.
Figure 4: Known and unknown probability distributions.
Figure 5: Visualization of facial expression features extracted by different methods. Owing to space limitation, we only present the results on the 7 basic emotion OV-FER task under the openness O (3:4). More visualization results can be shown in the supplementary material.

Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting

TL;DR

Abstract

Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting

Authors

TL;DR

Abstract

Table of Contents

Figures (5)