Table of Contents
Fetching ...

Emotic Masked Autoencoder with Attention Fusion for Facial Expression Recognition

Bach Nguyen-Xuan, Thien Nguyen-Hoang, Thanh-Huy Nguyen, Nhu Tai-Do

TL;DR

This work tackles facial expression recognition under data-scarce conditions by marrying self-supervised pre-training with masked autoencoders (MAE-Face) and a multi-view Fusion Attention scheme. It introduces a two-stage pipeline: pre-train a MAE-Face backbone on large unlabeled facial datasets, then fine-tune and fuse multi-view features (including eye, mouth, and nose patches) using self- and local-attention with skip connections. Key contributions include a robust data-synthesis framework, a detailed exploration of fusion strategies (Mean, Concat, UpDown variants), and showing that eye-mouth cues with Concat fusion yield strong performance on Aff-Wild2, especially when smoothed with a sliding window. The approach achieves competitive results on the ABAW6 benchmark with reduced training data and provides practical insights for leveraging localized facial regions in dynamic FER tasks, with potential for broader application in affective computing.

Abstract

Facial Expression Recognition (FER) is a critical task within computer vision with diverse applications across various domains. Addressing the challenge of limited FER datasets, which hampers the generalization capability of expression recognition models, is imperative for enhancing performance. Our paper presents an innovative approach integrating the MAE-Face self-supervised learning (SSL) method and multi-view Fusion Attention mechanism for expression classification, particularly showcased in the 6th Affective Behavior Analysis in-the-wild (ABAW) competition. By utilizing low-level feature information from the ipsilateral view (auxiliary view) before learning the high-level feature that emphasizes the shift in the human facial expression, our work seeks to provide a straightforward yet innovative way to improve the examined view (main view). We also suggest easy-to-implement and no-training frameworks aimed at highlighting key facial features to determine if such features can serve as guides for the model, focusing on pivotal local elements. The efficacy of this method is validated by improvements in model performance on the Aff-wild2 dataset, as observed in both training and validation contexts.

Emotic Masked Autoencoder with Attention Fusion for Facial Expression Recognition

TL;DR

This work tackles facial expression recognition under data-scarce conditions by marrying self-supervised pre-training with masked autoencoders (MAE-Face) and a multi-view Fusion Attention scheme. It introduces a two-stage pipeline: pre-train a MAE-Face backbone on large unlabeled facial datasets, then fine-tune and fuse multi-view features (including eye, mouth, and nose patches) using self- and local-attention with skip connections. Key contributions include a robust data-synthesis framework, a detailed exploration of fusion strategies (Mean, Concat, UpDown variants), and showing that eye-mouth cues with Concat fusion yield strong performance on Aff-Wild2, especially when smoothed with a sliding window. The approach achieves competitive results on the ABAW6 benchmark with reduced training data and provides practical insights for leveraging localized facial regions in dynamic FER tasks, with potential for broader application in affective computing.

Abstract

Facial Expression Recognition (FER) is a critical task within computer vision with diverse applications across various domains. Addressing the challenge of limited FER datasets, which hampers the generalization capability of expression recognition models, is imperative for enhancing performance. Our paper presents an innovative approach integrating the MAE-Face self-supervised learning (SSL) method and multi-view Fusion Attention mechanism for expression classification, particularly showcased in the 6th Affective Behavior Analysis in-the-wild (ABAW) competition. By utilizing low-level feature information from the ipsilateral view (auxiliary view) before learning the high-level feature that emphasizes the shift in the human facial expression, our work seeks to provide a straightforward yet innovative way to improve the examined view (main view). We also suggest easy-to-implement and no-training frameworks aimed at highlighting key facial features to determine if such features can serve as guides for the model, focusing on pivotal local elements. The efficacy of this method is validated by improvements in model performance on the Aff-wild2 dataset, as observed in both training and validation contexts.
Paper Structure (22 sections, 5 equations, 3 figures, 4 tables)

This paper contains 22 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Our proposed pipeline for two-stage pre-training and fine-tuning with fusion, a synthesizing framework to take the informative facial feature with uni-task expression annotations.
  • Figure 2: The architecture of our proposed model consists of two stages. First, the pre-trained model MAE is fine-tuned on two different datasets, the original dataset, and the extracted feature dataset. Secondly, we train the feature spaces of two fine-tuned models on the attention fusion models with four methods, including concat, mean, updown-concat, updown-mean, on the key generator module.
  • Figure 3: After the images are cropped and aligned in the Aff-wild2 dataset, the image parts that contain just the mouth and eye are extracted for further processing.