Table of Contents
Fetching ...

Static for Dynamic: Towards a Deeper Understanding of Dynamic Facial Expressions Using Static Expression Data

Yin Chen, Jia Li, Yu Zhang, Zhenzhen Hu, Shiguang Shan, Meng Wang, Richang Hong

TL;DR

This work introduces S4D, a unified dual-modal learning framework that leverages static facial expression data to enhance dynamic facial expression recognition. It combines dual-modal self-supervised pre-training with joint fine-tuning on SFER and DFER data and integrates a Mixture of Adapter Experts (MoAE) to mitigate negative transfer between tasks. Through extensive experiments on DFEW, FERV39K, MAFW, and correlation analyses between SFER and DFER, S4D achieves state-of-the-art performance, demonstrating significant improvements in both UAR and WAR. The results highlight the value of cross-modal, cross-task collaboration for affective computing and offer a scalable approach to leveraging abundant static data in dynamic contexts.

Abstract

Dynamic facial expression recognition (DFER) infers emotions from the temporal evolution of expressions, unlike static facial expression recognition (SFER), which relies solely on a single snapshot. This temporal analysis provides richer information and promises greater recognition capability. However, current DFER methods often exhibit unsatisfied performance largely due to fewer training samples compared to SFER. Given the inherent correlation between static and dynamic expressions, we hypothesize that leveraging the abundant SFER data can enhance DFER. To this end, we propose Static-for-Dynamic (S4D), a unified dual-modal learning framework that integrates SFER data as a complementary resource for DFER. Specifically, S4D employs dual-modal self-supervised pre-training on facial images and videos using a shared Vision Transformer (ViT) encoder-decoder architecture, yielding improved spatiotemporal representations. The pre-trained encoder is then fine-tuned on static and dynamic expression datasets in a multi-task learning setup to facilitate emotional information interaction. Unfortunately, vanilla multi-task learning in our study results in negative transfer. To address this, we propose an innovative Mixture of Adapter Experts (MoAE) module that facilitates task-specific knowledge acquisition while effectively extracting shared knowledge from both static and dynamic expression data. Extensive experiments demonstrate that S4D achieves a deeper understanding of DFER, setting new state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with weighted average recall (WAR) of 53.65\%, 58.44\%, and 76.68\%, respectively. Additionally, a systematic correlation analysis between SFER and DFER tasks is presented, which further elucidates the potential benefits of leveraging SFER.

Static for Dynamic: Towards a Deeper Understanding of Dynamic Facial Expressions Using Static Expression Data

TL;DR

This work introduces S4D, a unified dual-modal learning framework that leverages static facial expression data to enhance dynamic facial expression recognition. It combines dual-modal self-supervised pre-training with joint fine-tuning on SFER and DFER data and integrates a Mixture of Adapter Experts (MoAE) to mitigate negative transfer between tasks. Through extensive experiments on DFEW, FERV39K, MAFW, and correlation analyses between SFER and DFER, S4D achieves state-of-the-art performance, demonstrating significant improvements in both UAR and WAR. The results highlight the value of cross-modal, cross-task collaboration for affective computing and offer a scalable approach to leveraging abundant static data in dynamic contexts.

Abstract

Dynamic facial expression recognition (DFER) infers emotions from the temporal evolution of expressions, unlike static facial expression recognition (SFER), which relies solely on a single snapshot. This temporal analysis provides richer information and promises greater recognition capability. However, current DFER methods often exhibit unsatisfied performance largely due to fewer training samples compared to SFER. Given the inherent correlation between static and dynamic expressions, we hypothesize that leveraging the abundant SFER data can enhance DFER. To this end, we propose Static-for-Dynamic (S4D), a unified dual-modal learning framework that integrates SFER data as a complementary resource for DFER. Specifically, S4D employs dual-modal self-supervised pre-training on facial images and videos using a shared Vision Transformer (ViT) encoder-decoder architecture, yielding improved spatiotemporal representations. The pre-trained encoder is then fine-tuned on static and dynamic expression datasets in a multi-task learning setup to facilitate emotional information interaction. Unfortunately, vanilla multi-task learning in our study results in negative transfer. To address this, we propose an innovative Mixture of Adapter Experts (MoAE) module that facilitates task-specific knowledge acquisition while effectively extracting shared knowledge from both static and dynamic expression data. Extensive experiments demonstrate that S4D achieves a deeper understanding of DFER, setting new state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with weighted average recall (WAR) of 53.65\%, 58.44\%, and 76.68\%, respectively. Additionally, a systematic correlation analysis between SFER and DFER tasks is presented, which further elucidates the potential benefits of leveraging SFER.
Paper Structure (32 sections, 13 equations, 7 figures, 12 tables)

This paper contains 32 sections, 13 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Performance comparison between previous SOTA methods zhao2023promptingsun2023maechen2023statictong2022videomae and our proposed S4D on FERV39K ferv39k2022, MAFW MAFW, and DFEW DFEW datasets. Unweighted average recall (UAR, %) and weighted average recall (WAR, %) are reported. S4D, which incorporates static expression knowledge through a unified dual-modal learning framework, consistently outperforms the baseline method, VideoMAE tong2022videomae, previously pre-trained on VoxCeleb2 chung2018voxceleb2, across all these real-world DFER datasets.
  • Figure 2: Overview of our proposed S4D framework. We utilize Vision Transformer (ViT) dosovitskiy2020image as the backbone and pre-train it on facial image and video datasets using Masked Autoencoders feichtenhofer2022masked. The pre-trained ViT encoder is then used to initialize the S4D encoder, which is further fine-tuned on static and dynamic FER datasets. The proposed Mixture of Adapter Experts (MoAE) module is integrated into the ViT layers to create MoAE layers during joint fine-tuning. MLP$_I$ and MLP$_V$ denote the classification heads for SFER and DFER, while FFN, Norm, and MHSA represent the feed-forward network, layer normalization, and multi-head self-attention mechanisms, respectively.
  • Figure 3: Analyses of the total number of experts, the number of MoAE layers, and the proportion of SFER data used during dual-modal pre-training and joint fine-tuning.
  • Figure 4: The semantic relevance between SFER and DFER tasks. NE, HA, SA, AN, SU, DI, and FE denote neutral, happy, sad, anger, surprise, disgust, and fear, respectively.
  • Figure 5: Visualization of activation pathways. The figure shows the top-10 most frequently activated expert paths, where the top-2 are highlighted in color and the others in gray. Note that DFER and SFER tasks employ hard sharing through the FFN branch, enabling structural coupling while maintaining task-specific routing.
  • ...and 2 more figures