Table of Contents
Fetching ...

Unimodal-driven Distillation in Multimodal Emotion Recognition with Dynamic Fusion

Jiagen Li, Rui Yu, Huihao Huang, Huaicheng Yan

TL;DR

SUMMER tackles MERC by addressing modal heterogeneity and learning disorientation with a three-pronged approach: Sparse Dynamic Mixture of Experts (SDMoE) for token-wise intra-modal routing, Hierarchical Cross-Modal Fusion (HCMF) for robust inter-modal integration, and Interactive Knowledge Distillation (IKD) using a unimodal teacher to guide multimodal learning. It introduces formulations for dynamic routing, cross-modal fusion, and KD losses, including $L_{cross}^{KD}$, $L_{align}^{Label}$, and $L_{smooth}^{Label}$, combined as $L_{IKD} = \kappa_1 L_{cross}^{KD} + \kappa_2 L_{align}^{Label} + \kappa_3 L_{smooth}^{Label}$. Empirical results on IEMOCAP and MELD show superior performance, particularly for minority and semantically similar emotions, validating the effectiveness of unimodal-driven guidance and dynamic fusion for robust MERC. The work offers practical implications for dialogue systems and opinion analysis by enabling efficient, adaptive fusion across modalities with reduced learning disorientation and improved generalization.

Abstract

Multimodal Emotion Recognition in Conversations (MERC) identifies emotional states across text, audio and video, which is essential for intelligent dialogue systems and opinion analysis. Existing methods emphasize heterogeneous modal fusion directly for cross-modal integration, but often suffer from disorientation in multimodal learning due to modal heterogeneity and lack of instructive guidance. In this work, we propose SUMMER, a novel heterogeneous multimodal integration framework leveraging Mixture of Experts with Hierarchical Cross-modal Fusion and Interactive Knowledge Distillation. Key components include a Sparse Dynamic Mixture of Experts (SDMoE) for capturing dynamic token-wise interactions, a Hierarchical Cross-Modal Fusion (HCMF) for effective fusion of heterogeneous modalities, and Interactive Knowledge Distillation (IKD), which uses a pre-trained unimodal teacher to guide multimodal fusion in latent and logit spaces. Experiments on IEMOCAP and MELD show SUMMER outperforms state-of-the-art methods, particularly in recognizing minority and semantically similar emotions.

Unimodal-driven Distillation in Multimodal Emotion Recognition with Dynamic Fusion

TL;DR

SUMMER tackles MERC by addressing modal heterogeneity and learning disorientation with a three-pronged approach: Sparse Dynamic Mixture of Experts (SDMoE) for token-wise intra-modal routing, Hierarchical Cross-Modal Fusion (HCMF) for robust inter-modal integration, and Interactive Knowledge Distillation (IKD) using a unimodal teacher to guide multimodal learning. It introduces formulations for dynamic routing, cross-modal fusion, and KD losses, including , , and , combined as . Empirical results on IEMOCAP and MELD show superior performance, particularly for minority and semantically similar emotions, validating the effectiveness of unimodal-driven guidance and dynamic fusion for robust MERC. The work offers practical implications for dialogue systems and opinion analysis by enabling efficient, adaptive fusion across modalities with reduced learning disorientation and improved generalization.

Abstract

Multimodal Emotion Recognition in Conversations (MERC) identifies emotional states across text, audio and video, which is essential for intelligent dialogue systems and opinion analysis. Existing methods emphasize heterogeneous modal fusion directly for cross-modal integration, but often suffer from disorientation in multimodal learning due to modal heterogeneity and lack of instructive guidance. In this work, we propose SUMMER, a novel heterogeneous multimodal integration framework leveraging Mixture of Experts with Hierarchical Cross-modal Fusion and Interactive Knowledge Distillation. Key components include a Sparse Dynamic Mixture of Experts (SDMoE) for capturing dynamic token-wise interactions, a Hierarchical Cross-Modal Fusion (HCMF) for effective fusion of heterogeneous modalities, and Interactive Knowledge Distillation (IKD), which uses a pre-trained unimodal teacher to guide multimodal fusion in latent and logit spaces. Experiments on IEMOCAP and MELD show SUMMER outperforms state-of-the-art methods, particularly in recognizing minority and semantically similar emotions.

Paper Structure

This paper contains 32 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: A representative example of multimodal emotion recognition in conversations from The Big Bang Theory.
  • Figure 2: Illustration of the SUMMER framework, where the frozen teacher model is dedicated to mentoring the student model by providing a comprehensive guide for learning with Interactive Knowledge Distillation.
  • Figure 3: (a) SDMoE comprises two main components: the Auxiliary Expert Network and the Dynamic Routing Mechanism. Specifically, the dynamic router adjusts the relevance of the attention map to facilitate local token-wise interactions. (b) HCMF integrates a multi-level hierarchical structure for cross-modal fusion to enhance overall contextual understanding.
  • Figure 4: Visualization of features where each point corresponds to an utterance, with colors denoting different emotions.