Table of Contents
Fetching ...

MVP: Multimodal Emotion Recognition based on Video and Physiological Signals

Valeriya Strizhkova, Hadi Kachmar, Hava Chaptoukaev, Raphael Kalandadze, Natia Kukhilava, Tatia Tsmindashvili, Nibras Abo-Alzahab, Maria A. Zuluaga, Michal Balazia, Antitza Dantcheva, François Brémond, Laura Ferrari

TL;DR

The paper addresses improving emotion recognition by fusing facial video with physiological signals using a transformer-based MVP that can process long sequences (1–2 minutes). It systematically compares video backbones (AU-based vs VideoMAE) and a physiological backbone, proposing a mid-fusion cross-attention architecture to integrate modalities. Evaluated on AMIGOS and DEAP, MVP achieves state-of-the-art results over prior unimodal and multimodal approaches, with ablations confirming the benefits of long-sequence processing and cross-attention fusion. Future work includes separating ECG/EDA modalities and developing dedicated pre-training to further enhance performance and interpretability.

Abstract

Human emotions entail a complex set of behavioral, physiological and cognitive changes. Current state-of-the-art models fuse the behavioral and physiological components using classic machine learning, rather than recent deep learning techniques. We propose to fill this gap, designing the Multimodal for Video and Physio (MVP) architecture, streamlined to fuse video and physiological signals. Differently then others approaches, MVP exploits the benefits of attention to enable the use of long input sequences (1-2 minutes). We have studied video and physiological backbones for inputting long sequences and evaluated our method with respect to the state-of-the-art. Our results show that MVP outperforms former methods for emotion recognition based on facial videos, EDA, and ECG/PPG.

MVP: Multimodal Emotion Recognition based on Video and Physiological Signals

TL;DR

The paper addresses improving emotion recognition by fusing facial video with physiological signals using a transformer-based MVP that can process long sequences (1–2 minutes). It systematically compares video backbones (AU-based vs VideoMAE) and a physiological backbone, proposing a mid-fusion cross-attention architecture to integrate modalities. Evaluated on AMIGOS and DEAP, MVP achieves state-of-the-art results over prior unimodal and multimodal approaches, with ablations confirming the benefits of long-sequence processing and cross-attention fusion. Future work includes separating ECG/EDA modalities and developing dedicated pre-training to further enhance performance and interpretability.

Abstract

Human emotions entail a complex set of behavioral, physiological and cognitive changes. Current state-of-the-art models fuse the behavioral and physiological components using classic machine learning, rather than recent deep learning techniques. We propose to fill this gap, designing the Multimodal for Video and Physio (MVP) architecture, streamlined to fuse video and physiological signals. Differently then others approaches, MVP exploits the benefits of attention to enable the use of long input sequences (1-2 minutes). We have studied video and physiological backbones for inputting long sequences and evaluated our method with respect to the state-of-the-art. Our results show that MVP outperforms former methods for emotion recognition based on facial videos, EDA, and ECG/PPG.
Paper Structure (24 sections, 1 equation, 5 figures, 4 tables)

This paper contains 24 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Multimodal Emotion Recognition based on Video and Physiological Signals (MVP) architecture. Video features and raw physiological signals are input as full long sequences into the model to predict binary valence and arousal. The cross-attention transformer is used to fuse multiple modalities.
  • Figure 2: Pre-training and fine-tuning steps of VideoMAE in emotion recognition. In the pre-training step the autoencoder reconstructs the masked input video. During the fine-tuning, the pre-trained encoder is fine-tuned to predict binarized valence and arousal from the original not masked videos.
  • Figure 3: Label distribution of AMIGOS and DEAP datasets.
  • Figure 4: Crop comparison on the Amigos dataset. The larger crop X captures the entire face and a bit of background, the smaller crop Y captures exclusively the face. Larger crops give higher F1-score.
  • Figure 5: The success cases for the arousal and valence prediction on the DEAP dataset.