Table of Contents
Fetching ...

A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition

Jiajun Sun, Zhe Gao

Abstract

This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II addresses modality fusion and temporal consistency. For the visual modality, faces are re-cropped from raw videos at multiple scales, and the extracted visual features are averaged to form a robust frame-level representation. Concurrently, frame-aligned Wav2Vec 2.0 audio features are derived from short audio windows to provide complementary acoustic cues. These dual-modal features are integrated via a lightweight gated fusion module, followed by inference-time temporal smoothing. Experiments on the ABAW dataset demonstrate the effectiveness of the proposed method. The two-stage model achieves a Macro-F1 score of 0.5368 on the official validation set and 0.5122 +/- 0.0277 under 5-fold cross-validation, outperforming the official baselines.

A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition

Abstract

This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II addresses modality fusion and temporal consistency. For the visual modality, faces are re-cropped from raw videos at multiple scales, and the extracted visual features are averaged to form a robust frame-level representation. Concurrently, frame-aligned Wav2Vec 2.0 audio features are derived from short audio windows to provide complementary acoustic cues. These dual-modal features are integrated via a lightweight gated fusion module, followed by inference-time temporal smoothing. Experiments on the ABAW dataset demonstrate the effectiveness of the proposed method. The two-stage model achieves a Macro-F1 score of 0.5368 on the official validation set and 0.5122 +/- 0.0277 under 5-fold cross-validation, outperforming the official baselines.
Paper Structure (30 sections, 18 equations, 4 figures, 4 tables)

This paper contains 30 sections, 18 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The overview of our proposed model. For each target frame, three facial crops with different scales are extracted from the raw video and encoded by a DINOv2 backbone that has been pre-adapted in Stage I (see Fig. \ref{['fig:stage1_moe']}). The resulting multi-scale visual representation is fused with a frame-aligned Wav2Vec 2.0 audio representation through a gated fusion module for frame-level expression recognition. During inference, lightweight temporal smoothing is further applied to improve prediction stability and temporal consistency.
  • Figure 2: The DINOv2-based encoder, AffectNet and RAF-DB. PadAug is used to simulate boundary padding artifacts caused by large facial crops, while a training-only MoE task head provides sample-dependent expert routing to enhance DINOv2 adaptation. After Stage I, the MoE head is discarded and only the adapted DINOv2 backbone is retained \ref{['fig:framework']}.
  • Figure 3: Illustration of the proposed padding-aware augmentation (PadAug). The top row shows original facial crops, while the bottom row presents augmented samples where artificial padding is applied to different image boundaries to simulate boundary artifacts caused by imperfect face cropping in raw videos.
  • Figure 4: Macro-averaged $F_1$ score under different temporal smoothing strategies and window sizes.