MVP: Multimodal Emotion Recognition based on Video and Physiological Signals

Valeriya Strizhkova; Hadi Kachmar; Hava Chaptoukaev; Raphael Kalandadze; Natia Kukhilava; Tatia Tsmindashvili; Nibras Abo-Alzahab; Maria A. Zuluaga; Michal Balazia; Antitza Dantcheva; François Brémond; Laura Ferrari

MVP: Multimodal Emotion Recognition based on Video and Physiological Signals

Valeriya Strizhkova, Hadi Kachmar, Hava Chaptoukaev, Raphael Kalandadze, Natia Kukhilava, Tatia Tsmindashvili, Nibras Abo-Alzahab, Maria A. Zuluaga, Michal Balazia, Antitza Dantcheva, François Brémond, Laura Ferrari

TL;DR

The paper addresses improving emotion recognition by fusing facial video with physiological signals using a transformer-based MVP that can process long sequences (1–2 minutes). It systematically compares video backbones (AU-based vs VideoMAE) and a physiological backbone, proposing a mid-fusion cross-attention architecture to integrate modalities. Evaluated on AMIGOS and DEAP, MVP achieves state-of-the-art results over prior unimodal and multimodal approaches, with ablations confirming the benefits of long-sequence processing and cross-attention fusion. Future work includes separating ECG/EDA modalities and developing dedicated pre-training to further enhance performance and interpretability.

Abstract

Human emotions entail a complex set of behavioral, physiological and cognitive changes. Current state-of-the-art models fuse the behavioral and physiological components using classic machine learning, rather than recent deep learning techniques. We propose to fill this gap, designing the Multimodal for Video and Physio (MVP) architecture, streamlined to fuse video and physiological signals. Differently then others approaches, MVP exploits the benefits of attention to enable the use of long input sequences (1-2 minutes). We have studied video and physiological backbones for inputting long sequences and evaluated our method with respect to the state-of-the-art. Our results show that MVP outperforms former methods for emotion recognition based on facial videos, EDA, and ECG/PPG.

MVP: Multimodal Emotion Recognition based on Video and Physiological Signals

TL;DR

Abstract

Paper Structure (24 sections, 1 equation, 5 figures, 4 tables)

This paper contains 24 sections, 1 equation, 5 figures, 4 tables.

Introduction
Related Work
Multimodal Emotion Recognition
Video cues.
Physiological cues.
Method
VideoMAE for Long Input Videos
Selected Backbones for Features Extraction
Vision backbone.
Physiological backbone.
MVP Architecture
Data Handling
Experiments
Emotion Recognition Labels
Datasets and Pre-processing
...and 9 more sections

Figures (5)

Figure 1: Multimodal Emotion Recognition based on Video and Physiological Signals (MVP) architecture. Video features and raw physiological signals are input as full long sequences into the model to predict binary valence and arousal. The cross-attention transformer is used to fuse multiple modalities.
Figure 2: Pre-training and fine-tuning steps of VideoMAE in emotion recognition. In the pre-training step the autoencoder reconstructs the masked input video. During the fine-tuning, the pre-trained encoder is fine-tuned to predict binarized valence and arousal from the original not masked videos.
Figure 3: Label distribution of AMIGOS and DEAP datasets.
Figure 4: Crop comparison on the Amigos dataset. The larger crop X captures the entire face and a bit of background, the smaller crop Y captures exclusively the face. Larger crops give higher F1-score.
Figure 5: The success cases for the arousal and valence prediction on the DEAP dataset.

MVP: Multimodal Emotion Recognition based on Video and Physiological Signals

TL;DR

Abstract

MVP: Multimodal Emotion Recognition based on Video and Physiological Signals

Authors

TL;DR

Abstract

Table of Contents

Figures (5)