Uncertainty-Aware Multimodal Emotion Recognition through Dirichlet Parameterization
Rémi Grzeczkowicz, Eric Soriano, Ali Janati, Miyu Zhang, Gerard Comas-Quiles, Victor Carballo Araruna, Aneesh Jonelagadda
TL;DR
The paper tackles uncertainty-aware multimodal emotion recognition on edge devices by fusing speech, text, and facial cues through a Dirichlet-evidence framework grounded in Dempster–Shafer theory. Each modality uses a lightweight backbone—Emotion2Vec for speech, DistilRoBERTa for text, and ResEmotNet for images—and the fusion operates directly on raw logits, with final probabilities computed via $\hat{p}_k = \alpha_k / S$ where $S = \sum_i \alpha_i$ and uncertainty scales as $u = K / S$. The approach demonstrates competitive accuracy with a compact footprint across five benchmarks (eNTERFACE05, MEAD, MELD, RAVDESS, CREMA-D), and includes a neutral-tolerant evaluation showing robust fallback behavior such as mapping unseen contempt to neutral. The method supports missing modalities and is designed for privacy-preserving, real-world deployment, offering practical implications for healthcare and human–computer interaction; future work includes temporal modeling, additional modalities, personalization, and integration with speech-to-speech emotion recognition.
Abstract
In this work, we present a lightweight and privacy-preserving Multimodal Emotion Recognition (MER) framework designed for deployment on edge devices. To demonstrate framework's versatility, our implementation uses three modalities - speech, text and facial imagery. However, the system is fully modular, and can be extended to support other modalities or tasks. Each modality is processed through a dedicated backbone optimized for inference efficiency: Emotion2Vec for speech, a ResNet-based model for facial expressions, and DistilRoBERTa for text. To reconcile uncertainty across modalities, we introduce a model- and task-agnostic fusion mechanism grounded in Dempster-Shafer theory and Dirichlet evidence. Operating directly on model logits, this approach captures predictive uncertainty without requiring additional training or joint distribution estimation, making it broadly applicable beyond emotion recognition. Validation on five benchmark datasets (eNTERFACE05, MEAD, MELD, RAVDESS and CREMA-D) show that our method achieves competitive accuracy while remaining computationally efficient and robust to ambiguous or missing inputs. Overall, the proposed framework emphasizes modularity, scalability, and real-world feasibility, paving the way toward uncertainty-aware multimodal systems for healthcare, human-computer interaction, and other emotion-informed applications.
