Table of Contents
Fetching ...

Enhancing Facial Expression Recognition through Dual-Direction Attention Mixed Feature Networks and CLIP: Application to 8th ABAW Challenge

Josep Cabacas-Maso, Elena Ortega-Beltrán, Ismael Benito-Altamirano, Carles Ventura

TL;DR

This work tackles in-the-wild facial affect analysis across Valence-Arousal (VA) estimation, Expression (EXPR) recognition, and Action Unit (AU) detection by adapting the Dual-Direction Attention Mixed Feature Network (DDAMFN) to all three tasks and augmenting emotion recognition with CLIP-based embeddings. The methodology combines a MobileFaceNet-based feature extractor, a Dual-Direction Attention (DDA) module, and a Global Depthwise Convolution (GDConv) layer, with task-specific heads, plus a CLIP-vision–language pathway trained with a contrastive loss. Training leverages AffectNet-8 pretrained weights, task-specific classifiers, and temporal modeling via LSTMs to capture dynamics, alongside AU-threshold optimization. Results show the DDAMFN+LSTM variant achieving the best overall performance ($CCC_{VA}=0.479$, $F1_{AU}=0.411$, $F1_{AUopt}=0.451$), while CLIP-based approaches notably improve expression recognition (e.g., $F1_{Expr}=0.336$ for CLIP+LSTM). These findings highlight the importance of temporal modeling and multimodal embeddings for robust affective analysis in the ABAW challenges, with practical impact in HCI, psychology, and clinical monitoring.

Abstract

We present our contribution to the 8th ABAW challenge at CVPR 2025, where we tackle valence-arousal estimation, emotion recognition, and facial action unit detection as three independent challenges. Our approach leverages the well-known Dual-Direction Attention Mixed Feature Network (DDAMFN) for all three tasks, achieving results that surpass the proposed baselines. Additionally, we explore the use of CLIP for the emotion recognition challenge as an additional experiment. We provide insights into the architectural choices that contribute to the strong performance of our methods.

Enhancing Facial Expression Recognition through Dual-Direction Attention Mixed Feature Networks and CLIP: Application to 8th ABAW Challenge

TL;DR

This work tackles in-the-wild facial affect analysis across Valence-Arousal (VA) estimation, Expression (EXPR) recognition, and Action Unit (AU) detection by adapting the Dual-Direction Attention Mixed Feature Network (DDAMFN) to all three tasks and augmenting emotion recognition with CLIP-based embeddings. The methodology combines a MobileFaceNet-based feature extractor, a Dual-Direction Attention (DDA) module, and a Global Depthwise Convolution (GDConv) layer, with task-specific heads, plus a CLIP-vision–language pathway trained with a contrastive loss. Training leverages AffectNet-8 pretrained weights, task-specific classifiers, and temporal modeling via LSTMs to capture dynamics, alongside AU-threshold optimization. Results show the DDAMFN+LSTM variant achieving the best overall performance (, , ), while CLIP-based approaches notably improve expression recognition (e.g., for CLIP+LSTM). These findings highlight the importance of temporal modeling and multimodal embeddings for robust affective analysis in the ABAW challenges, with practical impact in HCI, psychology, and clinical monitoring.

Abstract

We present our contribution to the 8th ABAW challenge at CVPR 2025, where we tackle valence-arousal estimation, emotion recognition, and facial action unit detection as three independent challenges. Our approach leverages the well-known Dual-Direction Attention Mixed Feature Network (DDAMFN) for all three tasks, achieving results that surpass the proposed baselines. Additionally, we explore the use of CLIP for the emotion recognition challenge as an additional experiment. We provide insights into the architectural choices that contribute to the strong performance of our methods.

Paper Structure

This paper contains 11 sections, 6 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Our DDAMFN zhang2023ddamfn architecture for the 8th ABAW challenge: MobileFaceNet (MFN) for feature extraction (grey), Dual-Direction Attention (DDA) module (green), Global Depthwise Convolution (GDConv) layer (red), and one fully-connected layer (yellow) for the specific task being addressed (valence-arousal prediction, emotion recognition, and action unit detection).
  • Figure 2: CLIP Architecture Overview: In this diagram, the architecture is divided into several key components. The visual path, highlighted in orange, processes the image input. The text path, shown in green, processes the textual input. The white areas represent the fully connected layers that bridge both paths. Finally, the similarity outputs, in purple, demonstrate how the model calculates the relationship between the image and text inputs.