Table of Contents
Fetching ...

From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos

Yin Chen, Jia Li, Shiguang Shan, Meng Wang, Richang Hong

TL;DR

This work tackles dynamic facial expression recognition in the wild by leveraging abundant static FER data and facial landmark cues via a parameter-efficient transfer framework named S2D. S2D extends a pre-trained Vision Transformer with Multi-View Complementary Prompters (MCP) for landmark-guided image representations and Temporal-Modeling Adapters (TMA) to capture temporal dynamics, while keeping most parameters frozen. An Emotion-Anchors Self-Distillation Loss (SDL) further mitigates label ambiguity by using reference emotion samples to provide soft supervision. Across SFER and DFER benchmarks, S2D achieves state-of-the-art or competitive results with markedly fewer tunable parameters, demonstrating the practicality and effectiveness of transferring static FER knowledge to dynamic settings. The approach offers a simple, scalable baseline for efficient video FER that can benefit real-world HCI, healthcare, and safety applications.

Abstract

Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations, e.g., insufficient quantity and diversity of pose, occlusion and illumination, as well as the inherent ambiguity of facial expressions. In contrast, static facial expression recognition (SFER) currently shows much higher performance and can benefit from more abundant high-quality training data. Moreover, the appearance features and dynamic dependencies of DFER remain largely unexplored. To tackle these challenges, we introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features, thereby significantly improving DFER performance. Firstly, we build and train an image model for SFER, which incorporates a standard Vision Transformer (ViT) and Multi-View Complementary Prompters (MCPs) only. Then, we obtain our video model (i.e., S2D), for DFER, by inserting Temporal-Modeling Adapters (TMAs) into the image model. MCPs enhance facial expression features with landmark-aware features inferred by an off-the-shelf facial landmark detector. And the TMAs capture and model the relationships of dynamic changes in facial expressions, effectively extending the pre-trained image model for videos. Notably, MCPs and TMAs only increase a fraction of trainable parameters (less than +10\%) to the original image model. Moreover, we present a novel Emotion-Anchors (i.e., reference samples for each emotion category) based Self-Distillation Loss to reduce the detrimental influence of ambiguous emotion labels, further enhancing our S2D. Experiments conducted on popular SFER and DFER datasets show that we achieve the state of the art.

From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos

TL;DR

This work tackles dynamic facial expression recognition in the wild by leveraging abundant static FER data and facial landmark cues via a parameter-efficient transfer framework named S2D. S2D extends a pre-trained Vision Transformer with Multi-View Complementary Prompters (MCP) for landmark-guided image representations and Temporal-Modeling Adapters (TMA) to capture temporal dynamics, while keeping most parameters frozen. An Emotion-Anchors Self-Distillation Loss (SDL) further mitigates label ambiguity by using reference emotion samples to provide soft supervision. Across SFER and DFER benchmarks, S2D achieves state-of-the-art or competitive results with markedly fewer tunable parameters, demonstrating the practicality and effectiveness of transferring static FER knowledge to dynamic settings. The approach offers a simple, scalable baseline for efficient video FER that can benefit real-world HCI, healthcare, and safety applications.

Abstract

Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations, e.g., insufficient quantity and diversity of pose, occlusion and illumination, as well as the inherent ambiguity of facial expressions. In contrast, static facial expression recognition (SFER) currently shows much higher performance and can benefit from more abundant high-quality training data. Moreover, the appearance features and dynamic dependencies of DFER remain largely unexplored. To tackle these challenges, we introduce a novel Static-to-Dynamic model (S2D) that leverages existing SFER knowledge and dynamic information implicitly encoded in extracted facial landmark-aware features, thereby significantly improving DFER performance. Firstly, we build and train an image model for SFER, which incorporates a standard Vision Transformer (ViT) and Multi-View Complementary Prompters (MCPs) only. Then, we obtain our video model (i.e., S2D), for DFER, by inserting Temporal-Modeling Adapters (TMAs) into the image model. MCPs enhance facial expression features with landmark-aware features inferred by an off-the-shelf facial landmark detector. And the TMAs capture and model the relationships of dynamic changes in facial expressions, effectively extending the pre-trained image model for videos. Notably, MCPs and TMAs only increase a fraction of trainable parameters (less than +10\%) to the original image model. Moreover, we present a novel Emotion-Anchors (i.e., reference samples for each emotion category) based Self-Distillation Loss to reduce the detrimental influence of ambiguous emotion labels, further enhancing our S2D. Experiments conducted on popular SFER and DFER datasets show that we achieve the state of the art.
Paper Structure (32 sections, 21 equations, 8 figures, 10 tables)

This paper contains 32 sections, 21 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Performance comparison of dynamic facial expression recognition on the DFEW jiang2020dfew testing set. Bubble size indicates the model size. Our proposed S2D achieves the highest weighted average recall (WAR) while enjoying significantly less number of tunable parameters ($<10\%$ tunable parameters of the whole model). Here, we compare our S2D with C3D Tran2014LearningSF, R(2+1)D-18 tran2018closer, 3D ResNet-18 He2015DeepRL, Former-DFER Zhao2021FormerDFERDF, CEFLNet liu2022clip, EST liu2023expression, IAL li2023intensity, CLIPER Li2023CLIPERAU, DFER-CLIP zhao2023prompting and MAE-DFER sun2023mae.
  • Figure 2: Overall architecture of the proposed method. Our S2D accepts as input a facial expression image (or facial expression image sequence) $\bm{X}_F$ and a landmark-aware feature (or landmark-aware feature sequence) $\bm{X}_L$. The facial expression image and landmark-aware feature are embedded with patch embedding layers and fed into the transformer layers $\{E^l\}^{L-1}_{l=0}$ borrowed from ViT. Temporal-Modeling Adapter (TMA) is used to capture temporal information $\bm{\mathcal{T}}^l$ while Multi-View Complementary Prompter (MCP) uses landmark-aware features to generate guiding prompts $\bm{\mathcal{P}}^l$ to enhance the image-level representational ability for both SFER and DFER tasks. Note that the position embedding is added to $\bm{\mathcal{P}}^0$ after the first MCP block, and TMA is only used for the DFER task. Sg means stop gradient.
  • Figure 3: Temporal-Modeling Adapter (TMA) for temporal adaptation. The input $\bm{\mathcal{H}}^l \in \mathbb{R}^{T \times N \times D}$ is fed into a Temporal Adapter to capture temporal information, then it is fed into a LayerNorm and a Vanilla Adapter to reduce the domain gap between SFER and DFER. $\bm{\mathcal{T}}^{l+1} \in \mathbb{R}^{T \times N \times D}$ is learned temporal information.
  • Figure 4: Emotion-Anchors based Self-Distillation Loss.$\bm{p}$ is output probabilities of S2D about emotion anchors. $\sigma$ is the similarity score between input and emotion anchors. ($\bm{X}_i$, $\bm{Y}_i$) is a sample from current batch and $\bm{Y}^{pre}_i$ is the corresponding predicted probability. $Y^{soft}_i$ is produced soft label for $\bm{X}_i$.
  • Figure 5: The comparison of our proposed model with baseline at class level. We visualize the overall accuracy (WAR) and class accuracy of each emotion on DFEW (fd1) and FERV39K datasets. The baseline method is our model without TMA, MCP, and SDL. (i.e., line 1 in Table \ref{['tab:ablation_dfer']}).
  • ...and 3 more figures