Table of Contents
Fetching ...

Emotion Recognition in Multi-Speaker Conversations through Speaker Identification, Knowledge Distillation, and Hierarchical Fusion

Xiao Li, Kotaro Funakoshi, Manabu Okumura

TL;DR

This work tackles emotion recognition in multi-speaker conversations by addressing speaker ambiguity and severe class imbalance through three innovations: LipSyncNet for active speaker identification via audio–visual synchronization, cross-modal knowledge distillation that transfers textual emotion understanding to audio and visual modalities using graph-based teachers and students, and a hierarchical fusion framework with a composite loss to robustly handle imbalanced data. The method combines contextual textual features from RoBERTa, audio embeddings from Wav2Vec2.0, and visual representations from TimeSformer, all aligned to the true speaker and fused through adaptive gates, cross-modal attention, MoE layers, and transformer encoders. Empirical results on MELD and IEMOCAP demonstrate state-of-the-art weighted F1 scores of $67.75\%$ and $72.44\%$, with clear gains on minority emotions and statistically significant improvements over strong baselines. The approach offers practical impact for real-world conversational AI by delivering robust, speaker-aware, multimodal emotion understanding with balanced performance across emotion categories.

Abstract

Emotion recognition in multi-speaker conversations faces significant challenges due to speaker ambiguity and severe class imbalance. We propose a novel framework that addresses these issues through three key innovations: (1) a speaker identification module that leverages audio-visual synchronization to accurately identify the active speaker, (2) a knowledge distillation strategy that transfers superior textual emotion understanding to audio and visual modalities, and (3) hierarchical attention fusion with composite loss functions to handle class imbalance. Comprehensive evaluations on MELD and IEMOCAP datasets demonstrate superior performance, achieving 67.75% and 72.44% weighted F1 scores respectively, with particularly notable improvements on minority emotion classes.

Emotion Recognition in Multi-Speaker Conversations through Speaker Identification, Knowledge Distillation, and Hierarchical Fusion

TL;DR

This work tackles emotion recognition in multi-speaker conversations by addressing speaker ambiguity and severe class imbalance through three innovations: LipSyncNet for active speaker identification via audio–visual synchronization, cross-modal knowledge distillation that transfers textual emotion understanding to audio and visual modalities using graph-based teachers and students, and a hierarchical fusion framework with a composite loss to robustly handle imbalanced data. The method combines contextual textual features from RoBERTa, audio embeddings from Wav2Vec2.0, and visual representations from TimeSformer, all aligned to the true speaker and fused through adaptive gates, cross-modal attention, MoE layers, and transformer encoders. Empirical results on MELD and IEMOCAP demonstrate state-of-the-art weighted F1 scores of and , with clear gains on minority emotions and statistically significant improvements over strong baselines. The approach offers practical impact for real-world conversational AI by delivering robust, speaker-aware, multimodal emotion understanding with balanced performance across emotion categories.

Abstract

Emotion recognition in multi-speaker conversations faces significant challenges due to speaker ambiguity and severe class imbalance. We propose a novel framework that addresses these issues through three key innovations: (1) a speaker identification module that leverages audio-visual synchronization to accurately identify the active speaker, (2) a knowledge distillation strategy that transfers superior textual emotion understanding to audio and visual modalities, and (3) hierarchical attention fusion with composite loss functions to handle class imbalance. Comprehensive evaluations on MELD and IEMOCAP datasets demonstrate superior performance, achieving 67.75% and 72.44% weighted F1 scores respectively, with particularly notable improvements on minority emotion classes.

Paper Structure

This paper contains 36 sections, 22 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Overall architecture of the proposed multimodal conversational emotion recognition system.
  • Figure 2: Confusion matrix for emotion classification on MELD dataset showing improved performance on minority emotion classes.
  • Figure 3: Confusion matrix for emotion classification on IEMOCAP dataset demonstrating balanced performance across emotion categories and clear separation between positive and negative emotional states.