Table of Contents
Fetching ...

HSEmotion Team at ABAW-8 Competition: Audiovisual Ambivalence/Hesitancy, Emotional Mimicry Intensity and Facial Expression Recognition

Andrey V. Savchenko

TL;DR

This work presents a lightweight, multimodal framework for affective behavior analysis in-the-wild, addressing FER, EMI, and AH recognition in ABAW-8. By fusing visual embeddings from EmotiEffLib with acoustic features (wav2vec 2.0, HuBERT) and text embeddings from speech transcripts, and applying frame-level filtering and temporal smoothing, the approach achieves significant improvements over baselines across all tasks. The paper demonstrates that frame filtering based on high-confidence FER frames and simple late fusion strategies can yield robust performance without extensive fine-tuning. The reported results show notable gains in FER (up to 44.59% F1 with multimodal fusion), EMI (up to PCC ~0.446 with text- and audio-rich embeddings), and AH recognition (F1 ~73.7%), highlighting the practical value of multimodal, computation-efficient affective analysis for real-world applications.

Abstract

This article presents our results for the eighth Affective Behavior Analysis in-the-Wild (ABAW) competition. We combine facial emotional descriptors extracted by pre-trained models, namely, our EmotiEffLib library, with acoustic features and embeddings of texts recognized from speech. The frame-level features are aggregated and fed into simple classifiers, e.g., multi-layered perceptron (feed-forward neural network with one hidden layer), to predict ambivalence/hesitancy and facial expressions. In the latter case, we also use the pre-trained facial expression recognition model to select high-score video frames and prevent their processing with a domain-specific video classifier. The video-level prediction of emotional mimicry intensity is implemented by simply aggregating frame-level features and training a multi-layered perceptron. Experimental results for three tasks from the ABAW challenge demonstrate that our approach significantly increases validation metrics compared to existing baselines.

HSEmotion Team at ABAW-8 Competition: Audiovisual Ambivalence/Hesitancy, Emotional Mimicry Intensity and Facial Expression Recognition

TL;DR

This work presents a lightweight, multimodal framework for affective behavior analysis in-the-wild, addressing FER, EMI, and AH recognition in ABAW-8. By fusing visual embeddings from EmotiEffLib with acoustic features (wav2vec 2.0, HuBERT) and text embeddings from speech transcripts, and applying frame-level filtering and temporal smoothing, the approach achieves significant improvements over baselines across all tasks. The paper demonstrates that frame filtering based on high-confidence FER frames and simple late fusion strategies can yield robust performance without extensive fine-tuning. The reported results show notable gains in FER (up to 44.59% F1 with multimodal fusion), EMI (up to PCC ~0.446 with text- and audio-rich embeddings), and AH recognition (F1 ~73.7%), highlighting the practical value of multimodal, computation-efficient affective analysis for real-world applications.

Abstract

This article presents our results for the eighth Affective Behavior Analysis in-the-Wild (ABAW) competition. We combine facial emotional descriptors extracted by pre-trained models, namely, our EmotiEffLib library, with acoustic features and embeddings of texts recognized from speech. The frame-level features are aggregated and fed into simple classifiers, e.g., multi-layered perceptron (feed-forward neural network with one hidden layer), to predict ambivalence/hesitancy and facial expressions. In the latter case, we also use the pre-trained facial expression recognition model to select high-score video frames and prevent their processing with a domain-specific video classifier. The video-level prediction of emotional mimicry intensity is implemented by simply aggregating frame-level features and training a multi-layered perceptron. Experimental results for three tasks from the ABAW challenge demonstrate that our approach significantly increases validation metrics compared to existing baselines.

Paper Structure

This paper contains 13 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Proposed approach.
  • Figure 2: Dependence of F1-score for video Expr recognition on the smoothing kernel size $k$.
  • Figure 3: Dependence of F1-score for video Expr recognition on the filtering threshold $t$.
  • Figure 4: Dependence of F1-score for audio-visual Expr recognition on the weight $w$.
  • Figure 5: Dependence of F1-score for audio-visual Expr recognition on the filtering threshold $t$.