OUS: Scene-Guided Dynamic Facial Expression Recognition

Xinji Mai; Haoran Wang; Zeng Tao; Junxiong Lin; Shaoqi Yan; Yan Wang; Jing Liu; Jiawen Yu; Xuan Tong; Yating Li; Wenqiang Zhang

OUS: Scene-Guided Dynamic Facial Expression Recognition

Xinji Mai, Haoran Wang, Zeng Tao, Junxiong Lin, Shaoqi Yan, Yan Wang, Jing Liu, Jiawen Yu, Xuan Tong, Yating Li, Wenqiang Zhang

TL;DR

The paper addresses a key gap in dynamic facial expression recognition: the neglect of scene context leads to a rigid cognitive mismatch with human emotion understanding. It proposes Overall Understanding of the Scene (OUS), a scene-aware DFER framework that fuses facial and environmental cues using a Type Fusion Encoder and a triad of losses: $L_{sim}$, $L_{pol}$, and $L_{cont}$, with a conditional $L_{global}$. Learnable prompts augment cross-modal alignment, enabling robust contrastive learning against scene-aware representations. Experiments on the DFEW and FERV39k datasets demonstrate state-of-the-art performance and faster convergence, validating the effectiveness of incorporating scene polarity and cross-type fusion to mirror human contextual perception in emotion recognition.

Abstract

Dynamic Facial Expression Recognition (DFER) is crucial for affective computing but often overlooks the impact of scene context. We have identified a significant issue in current DFER tasks: human annotators typically integrate emotions from various angles, including environmental cues and body language, whereas existing DFER methods tend to consider the scene as noise that needs to be filtered out, focusing solely on facial information. We refer to this as the Rigid Cognitive Problem. The Rigid Cognitive Problem can lead to discrepancies between the cognition of annotators and models in some samples. To align more closely with the human cognitive paradigm of emotions, we propose an Overall Understanding of the Scene DFER method (OUS). OUS effectively integrates scene and facial features, combining scene-specific emotional knowledge for DFER. Extensive experiments on the two largest datasets in the DFER field, DFEW and FERV39k, demonstrate that OUS significantly outperforms existing methods. By analyzing the Rigid Cognitive Problem, OUS successfully understands the complex relationship between scene context and emotional expression, closely aligning with human emotional understanding in real-world scenarios.

OUS: Scene-Guided Dynamic Facial Expression Recognition

TL;DR

, and

, with a conditional

. Learnable prompts augment cross-modal alignment, enabling robust contrastive learning against scene-aware representations. Experiments on the DFEW and FERV39k datasets demonstrate state-of-the-art performance and faster convergence, validating the effectiveness of incorporating scene polarity and cross-type fusion to mirror human contextual perception in emotion recognition.

Abstract

Paper Structure (16 sections, 15 equations, 6 figures, 6 tables)

This paper contains 16 sections, 15 equations, 6 figures, 6 tables.

Introduction
RELATED WORK
DFER Methods
Scene Information in DFER Datasets
Methodology
Overview of Training Strategy for Different Losses
Type Fusion Encoder
Prompt Engineering
Other detail of Network Architecture
Experiment
Training Details
Performance Evaluation
Ablation Study
Discussion
Conclusion
...and 1 more sections

Figures (6)

Figure 1: Human annotators instinctively combine scene information and faces when labeling facial emotions. The DFER method only uses face information for emotion prediction.
Figure 2: Scene polarity helps determine mood. The left side shows facial expressions. When look at the face alone, the upper and lower expressions may be surprise, happiness, fear, sadness, etc., but by injecting the polarity of the scene, you can judge that these are two completely different expressions (happiness and sadness).
Figure 3: OUS Overall Architecture Diagram. OUS employs a dual-stream structure, separating images into facial and scene images through preprocessing. These images are encoded into latent spaces $V_f$ and $V_s$ using a shared Vision Encoder $f_V(\cdot)$. Temporal relationships are initially fused via LSTM. Facial features $V_f$ are further processed through a Frame Encoder, while scene features $V_s$, having less temporal variation, are processed through Frames Mean. Similarity Loss $L_{similarity}$ aligns the latent spaces of facial and scene features, reducing the distance between them using convolutional and fully connected layers. We also consider that the first four encoding blocks capture information like texture, color, and atmosphere, which are abstractly related to emotions. Polarity Loss constrains the Polarity Encoder to extract this polarity information, which is then concatenated with $V_{ft}$ and $V_{st}$ and fed into the Type Fusion Encoder. This mechanism uses Learnable Queries ($Q_{learn}$) as the Query, with $V_{ft}$ and $V_{st}$ as the Key and Value. Learnable Queries, initialized with a Gaussian distribution, are shared across all cross-type attention blocks and refined during training to capture emotional entities in the features. Finally, OUS incorporates substantial scene information, making fixed prompt methods unsuitable. Instead, it uses updatable prompts to compute contrastive loss with the output features.
Figure 4: Changes in Feature Clustering During the Inference Process. Global layout visualization of the feature space. In the legend, 0 to 6 represent happiness, sadness, neutral, anger, surprise, disgust, fear, respectively.
Figure 5: Confusion Matrices on the DFEW and FERV39k Dataset.
...and 1 more figures

OUS: Scene-Guided Dynamic Facial Expression Recognition

TL;DR

Abstract

OUS: Scene-Guided Dynamic Facial Expression Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (6)