MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network

Vrushank Ahire; Kunal Shah; Mudasir Nazir Khan; Nikhil Pakhale; Lownish Rai Sookha; M. A. Ganaie; Abhinav Dhall

MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network

Vrushank Ahire, Kunal Shah, Mudasir Nazir Khan, Nikhil Pakhale, Lownish Rai Sookha, M. A. Ganaie, Abhinav Dhall

TL;DR

This work tackles robust continuous emotion estimation in-the-wild by leveraging a multi-modal attention framework that fuses visual, audio, and textual cues. MAVEN employs bidirectional cross-modal attention across all modality pairs, followed by intra-modal BEiT-based refinement and a polar-coordinate VA predictor, enabling effective temporal and cross-modal information exchange. On Aff-Wild2, MAVEN achieves a state-of-the-art $CCC_{avg} = 0.3061$, outpacing the baseline and demonstrating the value of integrated cross-modal modeling with a circumplex VA representation. The approach offers practical impact for real-world affective computing applications by improving recognition in unconstrained settings and aligning predictions with psychological emotion models through $valence$ and $arousal$ expressed via $valence = I \cos(\theta)$ and $arousal = I \sin(\theta)$.

Abstract

Dynamic emotion recognition in the wild remains challenging due to the transient nature of emotional expressions and temporal misalignment of multi-modal cues. Traditional approaches predict valence and arousal and often overlook the inherent correlation between these two dimensions. The proposed Multi-modal Attention for Valence-Arousal Emotion Network (MAVEN) integrates visual, audio, and textual modalities through a bi-directional cross-modal attention mechanism. MAVEN uses modality-specific encoders to extract features from synchronized video frames, audio segments, and transcripts, predicting emotions in polar coordinates following Russell's circumplex model. The evaluation of the Aff-Wild2 dataset using MAVEN achieved a concordance correlation coefficient (CCC) of 0.3061, surpassing the ResNet-50 baseline model with a CCC of 0.22. The multistage architecture captures the subtle and transient nature of emotional expressions in conversational videos and improves emotion recognition in real-world situations. The code is available at: https://github.com/Vrushank-Ahire/MAVEN_8th_ABAW

MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network

TL;DR

Abstract

MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)