Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion
Zanxu Wang, Homayoon Beigi
TL;DR
This work tackles data quality issues in multimodal emotion recognition in conversation by implementing a quality-control pipeline for MELD and IEMOCAP, ensuring speaker identity consistency, audio-text alignment, and reliable face tracking. It introduces a three-stage, identity-based transfer learning approach that leverages 512-d speaker and face embeddings, 768-d emotion-aware text embeddings via a fine-tuned MPNet-v2, and 128-d modality-specific refinements before fusion with a linear-time MAMBA block. On quality-controlled data, the system achieves 64.8% accuracy on MELD and 74.3% on IEMOCAP, demonstrating that carefully curated data and emotion-specific unimodal representations can yield competitive MERC performance, though fusion gains are modest and certain low-frequency emotions remain challenging. The study highlights the critical need for rigorous data curation and suggests future work in temporal modeling, self-supervised learning, and constructing high-quality, well-aligned multimodal datasets to advance practical MERC systems.
Abstract
This paper addresses data quality issues in multimodal emotion recognition in conversation (MERC) through systematic quality control and multi-stage transfer learning. We implement a quality control pipeline for MELD and IEMOCAP datasets that validates speaker identity, audio-text alignment, and face detection. We leverage transfer learning from speaker and face recognition, assuming that identity-discriminative embeddings capture not only stable acoustic and Facial traits but also person-specific patterns of emotional expression. We employ RecoMadeEasy(R) engines for extracting 512-dimensional speaker and face embeddings, fine-tune MPNet-v2 for emotion-aware text representations, and adapt these features through emotion-specific MLPs trained on unimodal datasets. MAMBA-based trimodal fusion achieves 64.8% accuracy on MELD and 74.3% on IEMOCAP. These results show that combining identity-based audio and visual embeddings with emotion-tuned text representations on a quality-controlled subset of data yields consistent competitive performance for multimodal emotion recognition in conversation and provides a basis for further improvement on challenging, low-frequency emotion classes.
