Unimodal Multi-Task Fusion for Emotional Mimicry Intensity Prediction

Tobias Hallmen; Fabian Deuser; Norbert Oswald; Elisabeth André

Unimodal Multi-Task Fusion for Emotional Mimicry Intensity Prediction

Tobias Hallmen, Fabian Deuser, Norbert Oswald, Elisabeth André

TL;DR

The paper tackles the challenge of predicting Emotional Mimicry Intensity (EMI) in-the-wild, where facial cues can be occluded, by proposing an audio-only unimodal multi-task fusion framework. It leverages a Wav2Vec 2.0 backbone fine-tuned for Valence-Arousal-Dominance (VAD) regression and combines its features with a global context vector and an LSTM to predict six EMI scores, trained end-to-end with a mean-squared error objective. Ablation studies reveal that a full config—Global Vector, Regression Head, LSTM, and VAD Head—delivers the best performance, achieving a peak ${\rho}_{VAL}$ around 0.386, while generic ASR pretraining provides limited gains and vision modalities can even hinder EMI prediction. The approach demonstrates that a carefully designed audio-centric model can surpass baselines and place highly in EMI-Challenge, offering an efficient pathway for real-world affective analysis and therapeutic contexts. Future work may explore text-enabled fusion and more sophisticated multimodal alignment to further improve EMI estimation without sacrificing efficiency.

Abstract

In this research, we introduce a novel methodology for assessing Emotional Mimicry Intensity (EMI) as part of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild. Our methodology utilises the Wav2Vec 2.0 architecture, which has been pre-trained on an extensive podcast dataset, to capture a wide array of audio features that include both linguistic and paralinguistic components. We refine our feature extraction process by employing a fusion technique that combines individual features with a global mean vector, thereby embedding a broader contextual understanding into our analysis. A key aspect of our approach is the multi-task fusion strategy that not only leverages these features but also incorporates a pre-trained Valence-Arousal-Dominance (VAD) model. This integration is designed to refine emotion intensity prediction by concurrently processing multiple emotional dimensions, thereby embedding a richer contextual understanding into our framework. For the temporal analysis of audio data, our feature fusion process utilises a Long Short-Term Memory (LSTM) network. This approach, which relies solely on the provided audio data, shows marked advancements over the existing baseline, offering a more comprehensive understanding of emotional mimicry in naturalistic settings, achieving the second place in the EMI challenge.

Unimodal Multi-Task Fusion for Emotional Mimicry Intensity Prediction

TL;DR

around 0.386, while generic ASR pretraining provides limited gains and vision modalities can even hinder EMI prediction. The approach demonstrates that a carefully designed audio-centric model can surpass baselines and place highly in EMI-Challenge, offering an efficient pathway for real-world affective analysis and therapeutic contexts. Future work may explore text-enabled fusion and more sophisticated multimodal alignment to further improve EMI estimation without sacrificing efficiency.

Abstract

Paper Structure (14 sections, 3 equations, 4 figures, 2 tables)

This paper contains 14 sections, 3 equations, 4 figures, 2 tables.

Introduction
Related work
Dataset and Challenge
Challenge
Descriptive Data Analysis
Methodology
End-to-end Description
Implementation Details
Evaluation
Ablation
Architectural Choices
Exploratory Findings
Discussion
Conclusion

Figures (4)

Figure 1: Boxplots of the label distribution: train (upper plot) and validation (lower plot).
Figure 2: Video duration's in seconds (top) and fps (bottom) distribution over all data splits in log-log scale.
Figure 3: Example of particular challenging videos: overall quality, e.g. illumination (left, manual crop), affects downstream tasks, e.g. face detection (right).
Figure 4: Architecture overview of our approach. We use a pre-trained Wav2Vec 2.0 model wagner_2022_6221127 with a Valence-Arousal-Dominance (VAD) module and extract the features as well as the VAD predictions. To leverage global context we use a global vector and fuse the temporal features in an LSTM.

Unimodal Multi-Task Fusion for Emotional Mimicry Intensity Prediction

TL;DR

Abstract

Unimodal Multi-Task Fusion for Emotional Mimicry Intensity Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (4)