Efficient Feature Extraction and Late Fusion Strategy for Audiovisual Emotional Mimicry Intensity Estimation
Jun Yu, Wangyuan Zhu, Jichao Zhu
TL;DR
This work tackles the Emotion Mimicry Intensity (EMI) estimation problem in-the-wild by proposing an efficient audiovisual feature extraction pipeline that combines dual-channel visual features from ResNet18 and facial Action Units (AUs) with audio features from Wav2Vec2.0. A Temporal Convolutional Network (TCN) and a Transformer encoder capture short- and long-range temporal dependencies, and a late fusion scheme averages unimodal predictions to produce the final EMI estimate. The approach yields a mean Pearson correlation across six emotion dimensions of 0.3288 on the validation set, outperforming baseline EMI methods and demonstrating the benefit of multimodal fusion with discriminative visual features and powerful audio representations. This solution offers a practical, scalable method for EMI estimation in affective computing and ABAW-style benchmarks, with potential applicability to real-time empathy-aware systems.
Abstract
In this paper, we present the solution to the Emotional Mimicry Intensity (EMI) Estimation challenge, which is part of 6th Affective Behavior Analysis in-the-wild (ABAW) Competition.The EMI Estimation challenge task aims to evaluate the emotional intensity of seed videos by assessing them from a set of predefined emotion categories (i.e., "Admiration", "Amusement", "Determination", "Empathic Pain", "Excitement" and "Joy"). To tackle this challenge, we extracted rich dual-channel visual features based on ResNet18 and AUs for the video modality and effective single-channel features based on Wav2Vec2.0 for the audio modality. This allowed us to obtain comprehensive emotional features for the audiovisual modality. Additionally, leveraging a late fusion strategy, we averaged the predictions of the visual and acoustic models, resulting in a more accurate estimation of audiovisual emotional mimicry intensity. Experimental results validate the effectiveness of our approach, with the average Pearson's correlation Coefficient($ρ$) across the 6 emotion dimensionson the validation set achieving 0.3288.
