Table of Contents
Fetching ...

Enhancing Facial Expression Recognition through Dual-Direction Attention Mixed Feature Networks: Application to 7th ABAW Challenge

Josep Cabacas-Maso, Elena Ortega-Beltrán, Ismael Benito-Altamirano, Carles Ventura

TL;DR

This work adapts the Dual-Direction Attention Mixed Feature Network (DDAMFN) to a multitask facial expression analysis setting for the 7th ABAW challenge, predicting valence-arousal, basic expressions, and action units from the s-AffWild2 subset. Using MobileFaceNet as the backbone, it adds a Dual-Direction Attention module and a Global Depthwise Convolution layer, with three task-specific heads. Through two training schemes—end-to-end fine-tuning and task-specific classifiers—the study evaluates performance across CCC and F1 metrics, highlighting the impact of threshold optimization on action units. Results show that while multitask fine-tuning can approach single-task performance, task-specific optimization and post-processing (thresholds) significantly affect AU and overall P scores. The findings underscore the potential of DDAMFN as a competitive multitask feature extractor and emphasize the importance of losses, data balance, and thresholds in ABAW-style evaluations.

Abstract

We present our contribution to the 7th ABAW challenge at ECCV 2024, by utilizing a Dual-Direction Attention Mixed Feature Network (DDAMFN) for multitask facial expression recognition, we achieve results far beyond the proposed baseline for the Multi-Task ABAW challenge. Our proposal uses the well-known DDAMFN architecture as base to effectively predict valence-arousal, emotion recognition, and facial action units. We demonstrate the architecture ability to handle these tasks simultaneously, providing insights into its architecture and the rationale behind its design. Additionally, we compare our results for a multitask solution with independent single-task performance.

Enhancing Facial Expression Recognition through Dual-Direction Attention Mixed Feature Networks: Application to 7th ABAW Challenge

TL;DR

This work adapts the Dual-Direction Attention Mixed Feature Network (DDAMFN) to a multitask facial expression analysis setting for the 7th ABAW challenge, predicting valence-arousal, basic expressions, and action units from the s-AffWild2 subset. Using MobileFaceNet as the backbone, it adds a Dual-Direction Attention module and a Global Depthwise Convolution layer, with three task-specific heads. Through two training schemes—end-to-end fine-tuning and task-specific classifiers—the study evaluates performance across CCC and F1 metrics, highlighting the impact of threshold optimization on action units. Results show that while multitask fine-tuning can approach single-task performance, task-specific optimization and post-processing (thresholds) significantly affect AU and overall P scores. The findings underscore the potential of DDAMFN as a competitive multitask feature extractor and emphasize the importance of losses, data balance, and thresholds in ABAW-style evaluations.

Abstract

We present our contribution to the 7th ABAW challenge at ECCV 2024, by utilizing a Dual-Direction Attention Mixed Feature Network (DDAMFN) for multitask facial expression recognition, we achieve results far beyond the proposed baseline for the Multi-Task ABAW challenge. Our proposal uses the well-known DDAMFN architecture as base to effectively predict valence-arousal, emotion recognition, and facial action units. We demonstrate the architecture ability to handle these tasks simultaneously, providing insights into its architecture and the rationale behind its design. Additionally, we compare our results for a multitask solution with independent single-task performance.
Paper Structure (8 sections, 1 equation, 1 figure, 6 tables)

This paper contains 8 sections, 1 equation, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Our DDAMFN zhang2023ddamfn architecture for the 7th ABAW challenge: MobileFaceNet (MFN) for feature extraction (grey), Dual-Direction Attention (DDA) module (green), Global Depthwise Convolution (GDConv) layer (red), and three fully-connected layers for valence-arousal prediction, emotion recognition, and action unit detection (yellow).