Table of Contents
Fetching ...

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Soumya Dutta, Smruthi Balaji, Sriram Ganapathy

TL;DR

This work proposes Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion.

Abstract

Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

TL;DR

This work proposes Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion.

Abstract

Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.
Paper Structure (31 sections, 16 equations, 9 figures, 3 tables)

This paper contains 31 sections, 16 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: (a) Training of the unimodal feature extraction module (Sec. \ref{['sec:unifeat']}) (b) The entire pipeline of MiSTER-E. The speech and text embedding modules are frozen during training of the rest of the pipeline. Two context addition networks (Sec. \ref{['sec:can']}) are trained for the two modalities along with a multimodal network (Sec. \ref{['sec:fusion']}). Finally, a mixture of experts gating network (Sec. \ref{['sec:moe']}) is trained to predict the emotion category for each utterance.
  • Figure 2: (a) The context addition network (for the speech modality) and (b) the multimodal network used in MiSTER-E. The inputs to both the blocks are derived from the uni-modal feature extractor modules. TIN stands for Temporal Inception Network, MHA stands for multi-head attention.
  • Figure 3: Performance of MiSTER-E with different values of the focal loss hyperparameter.
  • Figure 4: Comparison of model performance on IEMOCAP and MELD. (a) Effect of different MoE gating strategies in MiSTER-E. (b) Performance of unimodal SLLMs and LLMs across the two datasets.
  • Figure 5: (a) Distribution of weights for the experts for the different datasets and (b) The performance of the experts and MiSTER-E for the two datasets
  • ...and 4 more figures