Dynamic Modality and View Selection for Multimodal Emotion Recognition with Missing Modalities
Luciana Trinkaus Menon, Luiz Carlos Ribeiro Neduziak, Jean Paul Barddal, Alessandro Lameiras Koerich, Alceu de Souza Britto
TL;DR
This work tackles multimodal emotion recognition under missing modalities by comparing a dynamic ensemble selection approach with a cross-attention fusion model on the RECOLA dataset. It introduces a dynamic selection framework with DS, DW, DWS, and Meta-DW, and pairs it with a cross-attention architecture to fuse audio and video cues for continuous arousal and valence prediction. Through extensive k-fold cross-validation and modality-absence simulations (zero and mean vector replacements), the study shows that dynamic selection methods consistently outperform baselines when a modality is missing, while cross-attention offers robustness in certain missing-modality scenarios. The results highlight that audio primarily drives arousal, whereas video more strongly informs valence, and they demonstrate the practical viability of adaptive modality handling in real-world MER systems.
Abstract
The study of human emotions, traditionally a cornerstone in fields like psychology and neuroscience, has been profoundly impacted by the advent of artificial intelligence (AI). Multiple channels, such as speech (voice) and facial expressions (image), are crucial in understanding human emotions. However, AI's journey in multimodal emotion recognition (MER) is marked by substantial technical challenges. One significant hurdle is how AI models manage the absence of a particular modality - a frequent occurrence in real-world situations. This study's central focus is assessing the performance and resilience of two strategies when confronted with the lack of one modality: a novel multimodal dynamic modality and view selection and a cross-attention mechanism. Results on the RECOLA dataset show that dynamic selection-based methods are a promising approach for MER. In the missing modalities scenarios, all dynamic selection-based methods outperformed the baseline. The study concludes by emphasizing the intricate interplay between audio and video modalities in emotion prediction, showcasing the adaptability of dynamic selection methods in handling missing modalities.
