Table of Contents
Fetching ...

A Comprehensive Survey on Multi-modal Conversational Emotion Recognition with Deep Learning

Yuntao Shou, Tao Meng, Wei Ai, Fangze Fu, Nan Yin, Keqin Li

TL;DR

This survey examines deep-learning approaches to multi-modal conversational emotion recognition (MCER), focusing on four modeling paradigms: context-free, sequential context, distinguishing-speaker, and speaker-relationship modeling. It synthesizes public datasets, feature extraction pipelines, and representative models (e.g., TFN, LFM, DialogueRNN, DialogueGCN, LR-GCN) to compare performance and trade-offs, highlighting the superiority of graph-based speaker-relationship methods in many benchmarks. The article also discusses applications across domains, privacy concerns, and practical challenges like data scarcity, heterogeneity, and class imbalance, and outlines future directions including data generation, unbiased learning, zero-shot and multi-label settings, dynamic dialogue modeling, and efficient deployment. Overall, the work provides a comprehensive, technically oriented roadmap for advancing MCER in real-world systems with stronger cross-modal coherence and contextual awareness.

Abstract

Multi-modal conversation emotion recognition (MCER) aims to recognize and track the speaker's emotional state using text, speech, and visual information in the conversation scene. Analyzing and studying MCER issues is significant to affective computing, intelligent recommendations, and human-computer interaction fields. Unlike the traditional single-utterance multi-modal emotion recognition or single-modal conversation emotion recognition, MCER is a more challenging problem that needs to deal with more complex emotional interaction relationships. The critical issue is learning consistency and complementary semantics for multi-modal feature fusion based on emotional interaction relationships. To solve this problem, people have conducted extensive research on MCER based on deep learning technology, but there is still a lack of systematic review of the modeling methods. Therefore, a timely and comprehensive overview of MCER's recent advances in deep learning is of great significance to academia and industry. In this survey, we provide a comprehensive overview of MCER modeling methods and roughly divide MCER methods into four categories, i.e., context-free modeling, sequential context modeling, speaker-differentiated modeling, and speaker-relationship modeling. In addition, we further discuss MCER's publicly available popular datasets, multi-modal feature extraction methods, application areas, existing challenges, and future development directions. We hope that our review can help MCER researchers understand the current research status in emotion recognition, provide some inspiration, and develop more efficient models.

A Comprehensive Survey on Multi-modal Conversational Emotion Recognition with Deep Learning

TL;DR

This survey examines deep-learning approaches to multi-modal conversational emotion recognition (MCER), focusing on four modeling paradigms: context-free, sequential context, distinguishing-speaker, and speaker-relationship modeling. It synthesizes public datasets, feature extraction pipelines, and representative models (e.g., TFN, LFM, DialogueRNN, DialogueGCN, LR-GCN) to compare performance and trade-offs, highlighting the superiority of graph-based speaker-relationship methods in many benchmarks. The article also discusses applications across domains, privacy concerns, and practical challenges like data scarcity, heterogeneity, and class imbalance, and outlines future directions including data generation, unbiased learning, zero-shot and multi-label settings, dynamic dialogue modeling, and efficient deployment. Overall, the work provides a comprehensive, technically oriented roadmap for advancing MCER in real-world systems with stronger cross-modal coherence and contextual awareness.

Abstract

Multi-modal conversation emotion recognition (MCER) aims to recognize and track the speaker's emotional state using text, speech, and visual information in the conversation scene. Analyzing and studying MCER issues is significant to affective computing, intelligent recommendations, and human-computer interaction fields. Unlike the traditional single-utterance multi-modal emotion recognition or single-modal conversation emotion recognition, MCER is a more challenging problem that needs to deal with more complex emotional interaction relationships. The critical issue is learning consistency and complementary semantics for multi-modal feature fusion based on emotional interaction relationships. To solve this problem, people have conducted extensive research on MCER based on deep learning technology, but there is still a lack of systematic review of the modeling methods. Therefore, a timely and comprehensive overview of MCER's recent advances in deep learning is of great significance to academia and industry. In this survey, we provide a comprehensive overview of MCER modeling methods and roughly divide MCER methods into four categories, i.e., context-free modeling, sequential context modeling, speaker-differentiated modeling, and speaker-relationship modeling. In addition, we further discuss MCER's publicly available popular datasets, multi-modal feature extraction methods, application areas, existing challenges, and future development directions. We hope that our review can help MCER researchers understand the current research status in emotion recognition, provide some inspiration, and develop more efficient models.
Paper Structure (67 sections, 37 equations, 12 figures, 9 tables)

This paper contains 67 sections, 37 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: An example of a multimodal conversation emotion recognition dataset which contains three modal features: video, audio, and text. The task of MCER is to identify the emotion label of each speaker at the current moment based on the utterance content (e.g., neutral, angry, surprised, etc.).
  • Figure 2: Illustration of the advantage of multimodal fusion in emotion recognition. (a) Example from the IEMOCAP dataset showing that textual modality alone may fail to capture emotional intent (“Neutral”), while audio and visual modalities correctly identify the emotion as “Sad”. (b) Latent space visualization of GS-MCC with unimodal (text-only) input shows overlapping clusters and poor separation between emotion classes. (c) The same visualization under multimodal fusion shows significantly improved class separability, demonstrating the effectiveness of incorporating audio-visual information.
  • Figure 3: A taxonomy of modeling approaches for multi-modal conversational emotion recognition in conversation. We categorize existing MCER methods into four categories, i.e., context-free modeling, sequential context modeling, distinguishing-speaker modeling, and speaker-relationship modeling.
  • Figure 4: Timeline of multimodal conversational emotion recognition algorithms.
  • Figure 5: The proposed MCER methods mainly include multi-modal feature extraction, multi-modal emotion representation, and emotion classifier.
  • ...and 7 more figures