Solution for 8th Competition on Affective & Behavior Analysis in-the-wild
Jun Yu, Yunxiang Zhang, Xilong Lu, Yang Zheng, Yongqi Wang, Lingsi Zhu
TL;DR
Problem: Detecting facial Action Units (AUs) in-the-wild is challenging due to multimodal variability, lighting, and noise. The authors propose an audio-visual framework that uses Whisper for audio features and ConvNeXt for visual features, augmented with global and local views to capture both context and fine-grained cues. They introduce an advanced cross-modal fusion module with multi-scale self-attention and adaptive sliding windows to synchronize modalities, followed by a Temporal Convolutional Network and a two-layer MLP for AU prediction on the Aff-Wild2 dataset. Results on six-fold cross-validation show robust AU detection with F1 scores in the mid-50s percentile range, underscoring the value of global-local representations and cross-modal fusion for real-world affective analysis.
Abstract
In this report, we present our solution for the Action Unit (AU) Detection Challenge, in 8th Competition on Affective Behavior Analysis in-the-wild. In order to achieve robust and accurate classification of facial action unit in the wild environment, we introduce an innovative method that leverages audio-visual multimodal data. Our method employs ConvNeXt as the image encoder and uses Whisper to extract Mel spectrogram features. For these features, we utilize a Transformer encoder-based feature fusion module to integrate the affective information embedded in audio and image features. This ensures the provision of rich high-dimensional feature representations for the subsequent multilayer perceptron (MLP) trained on the Aff-Wild2 dataset, enhancing the accuracy of AU detection.
