A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Soumya Dutta; Smruthi Balaji; Sriram Ganapathy

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Soumya Dutta, Smruthi Balaji, Sriram Ganapathy

TL;DR

This work proposes Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion.

Abstract

Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

TL;DR

Abstract

Paper Structure (31 sections, 16 equations, 9 figures, 3 tables)

This paper contains 31 sections, 16 equations, 9 figures, 3 tables.

Introduction
Related Work
Proposed Method
Problem Description
Unimodal Feature Extraction
Conversational Modeling
Context Addition Network
Multimodal Network
Mixture-of-Experts Gating
Model Training
Loss Function
Multimodal Contrastive Loss
MoE Gating Loss
Total Loss
Experiments
...and 16 more sections

Figures (9)

Figure 1: (a) Training of the unimodal feature extraction module (Sec. \ref{['sec:unifeat']}) (b) The entire pipeline of MiSTER-E. The speech and text embedding modules are frozen during training of the rest of the pipeline. Two context addition networks (Sec. \ref{['sec:can']}) are trained for the two modalities along with a multimodal network (Sec. \ref{['sec:fusion']}). Finally, a mixture of experts gating network (Sec. \ref{['sec:moe']}) is trained to predict the emotion category for each utterance.
Figure 2: (a) The context addition network (for the speech modality) and (b) the multimodal network used in MiSTER-E. The inputs to both the blocks are derived from the uni-modal feature extractor modules. TIN stands for Temporal Inception Network, MHA stands for multi-head attention.
Figure 3: Performance of MiSTER-E with different values of the focal loss hyperparameter.
Figure 4: Comparison of model performance on IEMOCAP and MELD. (a) Effect of different MoE gating strategies in MiSTER-E. (b) Performance of unimodal SLLMs and LLMs across the two datasets.
Figure 5: (a) Distribution of weights for the experts for the different datasets and (b) The performance of the experts and MiSTER-E for the two datasets
...and 4 more figures

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

TL;DR

Abstract

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Authors

TL;DR

Abstract

Table of Contents

Figures (9)