Mamba-VA: A Mamba-based Approach for Continuous Emotion Recognition in Valence-Arousal Space
Yuheng Liang, Zheyu Wang, Feng Liu, Mingzhou Liu, Yu Yao
TL;DR
Mamba-VA targets continuous emotion recognition in the Valence-Arousal space by addressing long-term temporal dependencies through a hybrid architecture. It extracts robust visual features with a Masked Autoencoder, models local temporal dynamics via a Temporal Convolutional Network, and captures long-range dependencies with a Mamba-based encoder, followed by regression to continuous VA values. The approach demonstrates improved performance over baselines on the ABAW VA Estimation task, indicating effective long-sequence emotion modeling and good generalization on the Aff-Wild2 dataset. This work highlights the practical potential of combining efficient state-space-based sequence modeling with strong visual representations for real-world HCI and affective computing applications.
Abstract
Continuous Emotion Recognition (CER) plays a crucial role in intelligent human-computer interaction, mental health monitoring, and autonomous driving. Emotion modeling based on the Valence-Arousal (VA) space enables a more nuanced representation of emotional states. However, existing methods still face challenges in handling long-term dependencies and capturing complex temporal dynamics. To address these issues, this paper proposes a novel emotion recognition model, Mamba-VA, which leverages the Mamba architecture to efficiently model sequential emotional variations in video frames. First, the model employs a Masked Autoencoder (MAE) to extract deep visual features from video frames, enhancing the robustness of temporal information. Then, a Temporal Convolutional Network (TCN) is utilized for temporal modeling to capture local temporal dependencies. Subsequently, Mamba is applied for long-sequence modeling, enabling the learning of global emotional trends. Finally, a fully connected (FC) layer performs regression analysis to predict continuous valence and arousal values. Experimental results on the Valence-Arousal (VA) Estimation task of the 8th competition on Affective Behavior Analysis in-the-wild (ABAW) demonstrate that the proposed model achieves valence and arousal scores of 0.5362 (0.5036) and 0.4310 (0.4119) on the validation (test) set, respectively, outperforming the baseline. The source code is available on GitHub:https://github.com/FreedomPuppy77/Charon.
