Table of Contents
Fetching ...

Uncertainty-Aware Multimodal Emotion Recognition through Dirichlet Parameterization

Rémi Grzeczkowicz, Eric Soriano, Ali Janati, Miyu Zhang, Gerard Comas-Quiles, Victor Carballo Araruna, Aneesh Jonelagadda

TL;DR

The paper tackles uncertainty-aware multimodal emotion recognition on edge devices by fusing speech, text, and facial cues through a Dirichlet-evidence framework grounded in Dempster–Shafer theory. Each modality uses a lightweight backbone—Emotion2Vec for speech, DistilRoBERTa for text, and ResEmotNet for images—and the fusion operates directly on raw logits, with final probabilities computed via $\hat{p}_k = \alpha_k / S$ where $S = \sum_i \alpha_i$ and uncertainty scales as $u = K / S$. The approach demonstrates competitive accuracy with a compact footprint across five benchmarks (eNTERFACE05, MEAD, MELD, RAVDESS, CREMA-D), and includes a neutral-tolerant evaluation showing robust fallback behavior such as mapping unseen contempt to neutral. The method supports missing modalities and is designed for privacy-preserving, real-world deployment, offering practical implications for healthcare and human–computer interaction; future work includes temporal modeling, additional modalities, personalization, and integration with speech-to-speech emotion recognition.

Abstract

In this work, we present a lightweight and privacy-preserving Multimodal Emotion Recognition (MER) framework designed for deployment on edge devices. To demonstrate framework's versatility, our implementation uses three modalities - speech, text and facial imagery. However, the system is fully modular, and can be extended to support other modalities or tasks. Each modality is processed through a dedicated backbone optimized for inference efficiency: Emotion2Vec for speech, a ResNet-based model for facial expressions, and DistilRoBERTa for text. To reconcile uncertainty across modalities, we introduce a model- and task-agnostic fusion mechanism grounded in Dempster-Shafer theory and Dirichlet evidence. Operating directly on model logits, this approach captures predictive uncertainty without requiring additional training or joint distribution estimation, making it broadly applicable beyond emotion recognition. Validation on five benchmark datasets (eNTERFACE05, MEAD, MELD, RAVDESS and CREMA-D) show that our method achieves competitive accuracy while remaining computationally efficient and robust to ambiguous or missing inputs. Overall, the proposed framework emphasizes modularity, scalability, and real-world feasibility, paving the way toward uncertainty-aware multimodal systems for healthcare, human-computer interaction, and other emotion-informed applications.

Uncertainty-Aware Multimodal Emotion Recognition through Dirichlet Parameterization

TL;DR

The paper tackles uncertainty-aware multimodal emotion recognition on edge devices by fusing speech, text, and facial cues through a Dirichlet-evidence framework grounded in Dempster–Shafer theory. Each modality uses a lightweight backbone—Emotion2Vec for speech, DistilRoBERTa for text, and ResEmotNet for images—and the fusion operates directly on raw logits, with final probabilities computed via where and uncertainty scales as . The approach demonstrates competitive accuracy with a compact footprint across five benchmarks (eNTERFACE05, MEAD, MELD, RAVDESS, CREMA-D), and includes a neutral-tolerant evaluation showing robust fallback behavior such as mapping unseen contempt to neutral. The method supports missing modalities and is designed for privacy-preserving, real-world deployment, offering practical implications for healthcare and human–computer interaction; future work includes temporal modeling, additional modalities, personalization, and integration with speech-to-speech emotion recognition.

Abstract

In this work, we present a lightweight and privacy-preserving Multimodal Emotion Recognition (MER) framework designed for deployment on edge devices. To demonstrate framework's versatility, our implementation uses three modalities - speech, text and facial imagery. However, the system is fully modular, and can be extended to support other modalities or tasks. Each modality is processed through a dedicated backbone optimized for inference efficiency: Emotion2Vec for speech, a ResNet-based model for facial expressions, and DistilRoBERTa for text. To reconcile uncertainty across modalities, we introduce a model- and task-agnostic fusion mechanism grounded in Dempster-Shafer theory and Dirichlet evidence. Operating directly on model logits, this approach captures predictive uncertainty without requiring additional training or joint distribution estimation, making it broadly applicable beyond emotion recognition. Validation on five benchmark datasets (eNTERFACE05, MEAD, MELD, RAVDESS and CREMA-D) show that our method achieves competitive accuracy while remaining computationally efficient and robust to ambiguous or missing inputs. Overall, the proposed framework emphasizes modularity, scalability, and real-world feasibility, paving the way toward uncertainty-aware multimodal systems for healthcare, human-computer interaction, and other emotion-informed applications.
Paper Structure (20 sections, 15 equations, 3 figures)

This paper contains 20 sections, 15 equations, 3 figures.

Figures (3)

  • Figure 1: Accuracy with and without neutral tolerance across datasets, using basic and advanced mitigation.
  • Figure 2: Prediction of contempt, an unseen emotion, using our advanced mitigation.
  • Figure 3: Confusion matrices for emotion prediction on CREMA-D.