HSEmotion Team at ABAW-8 Competition: Audiovisual Ambivalence/Hesitancy, Emotional Mimicry Intensity and Facial Expression Recognition
Andrey V. Savchenko
TL;DR
This work presents a lightweight, multimodal framework for affective behavior analysis in-the-wild, addressing FER, EMI, and AH recognition in ABAW-8. By fusing visual embeddings from EmotiEffLib with acoustic features (wav2vec 2.0, HuBERT) and text embeddings from speech transcripts, and applying frame-level filtering and temporal smoothing, the approach achieves significant improvements over baselines across all tasks. The paper demonstrates that frame filtering based on high-confidence FER frames and simple late fusion strategies can yield robust performance without extensive fine-tuning. The reported results show notable gains in FER (up to 44.59% F1 with multimodal fusion), EMI (up to PCC ~0.446 with text- and audio-rich embeddings), and AH recognition (F1 ~73.7%), highlighting the practical value of multimodal, computation-efficient affective analysis for real-world applications.
Abstract
This article presents our results for the eighth Affective Behavior Analysis in-the-Wild (ABAW) competition. We combine facial emotional descriptors extracted by pre-trained models, namely, our EmotiEffLib library, with acoustic features and embeddings of texts recognized from speech. The frame-level features are aggregated and fed into simple classifiers, e.g., multi-layered perceptron (feed-forward neural network with one hidden layer), to predict ambivalence/hesitancy and facial expressions. In the latter case, we also use the pre-trained facial expression recognition model to select high-score video frames and prevent their processing with a domain-specific video classifier. The video-level prediction of emotional mimicry intensity is implemented by simply aggregating frame-level features and training a multi-layered perceptron. Experimental results for three tasks from the ABAW challenge demonstrate that our approach significantly increases validation metrics compared to existing baselines.
