Table of Contents
Fetching ...

Dynamic Modality and View Selection for Multimodal Emotion Recognition with Missing Modalities

Luciana Trinkaus Menon, Luiz Carlos Ribeiro Neduziak, Jean Paul Barddal, Alessandro Lameiras Koerich, Alceu de Souza Britto

TL;DR

This work tackles multimodal emotion recognition under missing modalities by comparing a dynamic ensemble selection approach with a cross-attention fusion model on the RECOLA dataset. It introduces a dynamic selection framework with DS, DW, DWS, and Meta-DW, and pairs it with a cross-attention architecture to fuse audio and video cues for continuous arousal and valence prediction. Through extensive k-fold cross-validation and modality-absence simulations (zero and mean vector replacements), the study shows that dynamic selection methods consistently outperform baselines when a modality is missing, while cross-attention offers robustness in certain missing-modality scenarios. The results highlight that audio primarily drives arousal, whereas video more strongly informs valence, and they demonstrate the practical viability of adaptive modality handling in real-world MER systems.

Abstract

The study of human emotions, traditionally a cornerstone in fields like psychology and neuroscience, has been profoundly impacted by the advent of artificial intelligence (AI). Multiple channels, such as speech (voice) and facial expressions (image), are crucial in understanding human emotions. However, AI's journey in multimodal emotion recognition (MER) is marked by substantial technical challenges. One significant hurdle is how AI models manage the absence of a particular modality - a frequent occurrence in real-world situations. This study's central focus is assessing the performance and resilience of two strategies when confronted with the lack of one modality: a novel multimodal dynamic modality and view selection and a cross-attention mechanism. Results on the RECOLA dataset show that dynamic selection-based methods are a promising approach for MER. In the missing modalities scenarios, all dynamic selection-based methods outperformed the baseline. The study concludes by emphasizing the intricate interplay between audio and video modalities in emotion prediction, showcasing the adaptability of dynamic selection methods in handling missing modalities.

Dynamic Modality and View Selection for Multimodal Emotion Recognition with Missing Modalities

TL;DR

This work tackles multimodal emotion recognition under missing modalities by comparing a dynamic ensemble selection approach with a cross-attention fusion model on the RECOLA dataset. It introduces a dynamic selection framework with DS, DW, DWS, and Meta-DW, and pairs it with a cross-attention architecture to fuse audio and video cues for continuous arousal and valence prediction. Through extensive k-fold cross-validation and modality-absence simulations (zero and mean vector replacements), the study shows that dynamic selection methods consistently outperform baselines when a modality is missing, while cross-attention offers robustness in certain missing-modality scenarios. The results highlight that audio primarily drives arousal, whereas video more strongly informs valence, and they demonstrate the practical viability of adaptive modality handling in real-world MER systems.

Abstract

The study of human emotions, traditionally a cornerstone in fields like psychology and neuroscience, has been profoundly impacted by the advent of artificial intelligence (AI). Multiple channels, such as speech (voice) and facial expressions (image), are crucial in understanding human emotions. However, AI's journey in multimodal emotion recognition (MER) is marked by substantial technical challenges. One significant hurdle is how AI models manage the absence of a particular modality - a frequent occurrence in real-world situations. This study's central focus is assessing the performance and resilience of two strategies when confronted with the lack of one modality: a novel multimodal dynamic modality and view selection and a cross-attention mechanism. Results on the RECOLA dataset show that dynamic selection-based methods are a promising approach for MER. In the missing modalities scenarios, all dynamic selection-based methods outperformed the baseline. The study concludes by emphasizing the intricate interplay between audio and video modalities in emotion prediction, showcasing the adaptability of dynamic selection methods in handling missing modalities.
Paper Structure (12 sections, 10 equations, 3 figures, 4 tables)

This paper contains 12 sections, 10 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Dynamic selection approach based on three steps: multimodal feature extraction, training, and testing. The audio features include acoustic features, MFCCs, and Mel spectrograms. The video features include appearance features and geometric features. All regressors are trained separately, and a pool of regressors is obtained. The models are evaluated in the dynamic ensemble selection (DES) phase, where each model receives a weight to evaluate each test case according to its assertiveness in the competence zone. The figure also illustrates the Dynamic Weighting Selection (DWS) calculation.
  • Figure 2: Arousal gold standard and prediction of models based on acoustic features, MFCCs, Mel spectrograms, appearance features, and geometric features. The image was generated using test case T2 (second person from the test set) of the second cross-validation fold (k=2) and in the scenario where a zero vector represents the absence of a modality. From top to bottom, we have: Prediction with all active modalities, prediction with the absence of audio modality, and prediction with the absence of video modality.
  • Figure 3: Comparison of arousal gold standard, prediction with all active modalities, prediction with the absence of audio modality, and prediction with the absence of video modality of the mean of the regressors' outputs and dynamic selection-based methods (DS, DW, DWS, Meta-DW). The image was generated using test case T2 (second person from the test set) of the second cross-validation fold (k=2) and in the scenario where a zero vector represents the absence of a modality.