MVP: Multimodal Emotion Recognition based on Video and Physiological Signals
Valeriya Strizhkova, Hadi Kachmar, Hava Chaptoukaev, Raphael Kalandadze, Natia Kukhilava, Tatia Tsmindashvili, Nibras Abo-Alzahab, Maria A. Zuluaga, Michal Balazia, Antitza Dantcheva, François Brémond, Laura Ferrari
TL;DR
The paper addresses improving emotion recognition by fusing facial video with physiological signals using a transformer-based MVP that can process long sequences (1–2 minutes). It systematically compares video backbones (AU-based vs VideoMAE) and a physiological backbone, proposing a mid-fusion cross-attention architecture to integrate modalities. Evaluated on AMIGOS and DEAP, MVP achieves state-of-the-art results over prior unimodal and multimodal approaches, with ablations confirming the benefits of long-sequence processing and cross-attention fusion. Future work includes separating ECG/EDA modalities and developing dedicated pre-training to further enhance performance and interpretability.
Abstract
Human emotions entail a complex set of behavioral, physiological and cognitive changes. Current state-of-the-art models fuse the behavioral and physiological components using classic machine learning, rather than recent deep learning techniques. We propose to fill this gap, designing the Multimodal for Video and Physio (MVP) architecture, streamlined to fuse video and physiological signals. Differently then others approaches, MVP exploits the benefits of attention to enable the use of long input sequences (1-2 minutes). We have studied video and physiological backbones for inputting long sequences and evaluated our method with respect to the state-of-the-art. Our results show that MVP outperforms former methods for emotion recognition based on facial videos, EDA, and ECG/PPG.
