Table of Contents
Fetching ...

Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion

Zanxu Wang, Homayoon Beigi

TL;DR

This work tackles data quality issues in multimodal emotion recognition in conversation by implementing a quality-control pipeline for MELD and IEMOCAP, ensuring speaker identity consistency, audio-text alignment, and reliable face tracking. It introduces a three-stage, identity-based transfer learning approach that leverages 512-d speaker and face embeddings, 768-d emotion-aware text embeddings via a fine-tuned MPNet-v2, and 128-d modality-specific refinements before fusion with a linear-time MAMBA block. On quality-controlled data, the system achieves 64.8% accuracy on MELD and 74.3% on IEMOCAP, demonstrating that carefully curated data and emotion-specific unimodal representations can yield competitive MERC performance, though fusion gains are modest and certain low-frequency emotions remain challenging. The study highlights the critical need for rigorous data curation and suggests future work in temporal modeling, self-supervised learning, and constructing high-quality, well-aligned multimodal datasets to advance practical MERC systems.

Abstract

This paper addresses data quality issues in multimodal emotion recognition in conversation (MERC) through systematic quality control and multi-stage transfer learning. We implement a quality control pipeline for MELD and IEMOCAP datasets that validates speaker identity, audio-text alignment, and face detection. We leverage transfer learning from speaker and face recognition, assuming that identity-discriminative embeddings capture not only stable acoustic and Facial traits but also person-specific patterns of emotional expression. We employ RecoMadeEasy(R) engines for extracting 512-dimensional speaker and face embeddings, fine-tune MPNet-v2 for emotion-aware text representations, and adapt these features through emotion-specific MLPs trained on unimodal datasets. MAMBA-based trimodal fusion achieves 64.8% accuracy on MELD and 74.3% on IEMOCAP. These results show that combining identity-based audio and visual embeddings with emotion-tuned text representations on a quality-controlled subset of data yields consistent competitive performance for multimodal emotion recognition in conversation and provides a basis for further improvement on challenging, low-frequency emotion classes.

Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion

TL;DR

This work tackles data quality issues in multimodal emotion recognition in conversation by implementing a quality-control pipeline for MELD and IEMOCAP, ensuring speaker identity consistency, audio-text alignment, and reliable face tracking. It introduces a three-stage, identity-based transfer learning approach that leverages 512-d speaker and face embeddings, 768-d emotion-aware text embeddings via a fine-tuned MPNet-v2, and 128-d modality-specific refinements before fusion with a linear-time MAMBA block. On quality-controlled data, the system achieves 64.8% accuracy on MELD and 74.3% on IEMOCAP, demonstrating that carefully curated data and emotion-specific unimodal representations can yield competitive MERC performance, though fusion gains are modest and certain low-frequency emotions remain challenging. The study highlights the critical need for rigorous data curation and suggests future work in temporal modeling, self-supervised learning, and constructing high-quality, well-aligned multimodal datasets to advance practical MERC systems.

Abstract

This paper addresses data quality issues in multimodal emotion recognition in conversation (MERC) through systematic quality control and multi-stage transfer learning. We implement a quality control pipeline for MELD and IEMOCAP datasets that validates speaker identity, audio-text alignment, and face detection. We leverage transfer learning from speaker and face recognition, assuming that identity-discriminative embeddings capture not only stable acoustic and Facial traits but also person-specific patterns of emotional expression. We employ RecoMadeEasy(R) engines for extracting 512-dimensional speaker and face embeddings, fine-tune MPNet-v2 for emotion-aware text representations, and adapt these features through emotion-specific MLPs trained on unimodal datasets. MAMBA-based trimodal fusion achieves 64.8% accuracy on MELD and 74.3% on IEMOCAP. These results show that combining identity-based audio and visual embeddings with emotion-tuned text representations on a quality-controlled subset of data yields consistent competitive performance for multimodal emotion recognition in conversation and provides a basis for further improvement on challenging, low-frequency emotion classes.

Paper Structure

This paper contains 16 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of the proposed three-stage quality-controlled multimodal emotion recognition pipeline. Stage 1 extracts foundation embeddings from RecoMadeEasy® speaker and face recognition engines r-m:recotech and a fine-tuned MPNet-based sentence transformer for MELD and IEMOCAP utterances. Stage 2 adapts the 512-dimensional audio and visual features into 128-dimensional emotion embeddings using modality-specific MLPs trained on unimodal emotion datasets. Stage 3 fuses 768-dimensional text, 128-dimensional speaker, and 128-dimensional facial embeddings into 1024-dimensional token representations and applies a single MAMBA block, pooling, and a linear classification head for utterance-level emotion prediction.
  • Figure 2: Distribution of emotion categories across auxiliary unimodal datasets. Top left: Visual datasets (CK+ and RAF-DB) showing 15,648 total samples. Top right: Audio datasets (CREMA-D, RAVDESS, SAVEE, and TESS) with 10,557 samples. Bottom: Text dataset composition showing balanced sampling from five sources (CrowdFlower, CARER, GoEmotions, ISEAR, and SemEval-2018) totaling 6.2k utterances across seven emotion categories.
  • Figure 3: Examples of quality control challenges and solutions in MELD dataset processing. (A) Speaker Facing Away: Utterances where the speaker's face is not visible (red box indicates detection failure) are removed to ensure reliable facial expression analysis. (B) Off-screen Speaker: Phone call scenes where the speaking character is not visible on camera are filtered out. (C) Successful Speaker Identification: Multi-person scenes where YOLOv8 detects multiple faces and Facenet-512 correctly identifies the target speaker (green boxes) based on embedding similarity across temporal frames (4.5s, 5.5s, 6.5s). (D) Audio-Text Misalignment: Example of detected misalignment between original transcript ("Yeah!") and Whisper transcription ("Yeah, it really has been great, too."), identified through low cosine similarity (0.30) and Levenshtein similarity (0.11) scores.
  • Figure 4: Distribution of audio-text alignment metrics across MELD splits. Left: Cosine similarity between original and Whisper-generated transcriptions using MPNet-v2 embeddings. Right: Levenshtein similarity measuring character-level differences. Red dashed lines indicate filtering thresholds (0.25 for cosine similarity, 0.3 for Levenshtein). Most utterances show high alignment, while removed samples cluster near low similarity scores.
  • Figure 5: Confusion matrix for trimodal fusion on MELD test set (7 classes) and IEMOCAP test set (4 classes).