Enhancing Facial Expression Recognition through Dual-Direction Attention Mixed Feature Networks and CLIP: Application to 8th ABAW Challenge
Josep Cabacas-Maso, Elena Ortega-Beltrán, Ismael Benito-Altamirano, Carles Ventura
TL;DR
This work tackles in-the-wild facial affect analysis across Valence-Arousal (VA) estimation, Expression (EXPR) recognition, and Action Unit (AU) detection by adapting the Dual-Direction Attention Mixed Feature Network (DDAMFN) to all three tasks and augmenting emotion recognition with CLIP-based embeddings. The methodology combines a MobileFaceNet-based feature extractor, a Dual-Direction Attention (DDA) module, and a Global Depthwise Convolution (GDConv) layer, with task-specific heads, plus a CLIP-vision–language pathway trained with a contrastive loss. Training leverages AffectNet-8 pretrained weights, task-specific classifiers, and temporal modeling via LSTMs to capture dynamics, alongside AU-threshold optimization. Results show the DDAMFN+LSTM variant achieving the best overall performance ($CCC_{VA}=0.479$, $F1_{AU}=0.411$, $F1_{AUopt}=0.451$), while CLIP-based approaches notably improve expression recognition (e.g., $F1_{Expr}=0.336$ for CLIP+LSTM). These findings highlight the importance of temporal modeling and multimodal embeddings for robust affective analysis in the ABAW challenges, with practical impact in HCI, psychology, and clinical monitoring.
Abstract
We present our contribution to the 8th ABAW challenge at CVPR 2025, where we tackle valence-arousal estimation, emotion recognition, and facial action unit detection as three independent challenges. Our approach leverages the well-known Dual-Direction Attention Mixed Feature Network (DDAMFN) for all three tasks, achieving results that surpass the proposed baselines. Additionally, we explore the use of CLIP for the emotion recognition challenge as an additional experiment. We provide insights into the architectural choices that contribute to the strong performance of our methods.
