Table of Contents
Fetching ...

Emotion Recognition and Generation: A Comprehensive Review of Face, Speech, and Text Modalities

Rebecca Mobbs, Dimitrios Makris, Vasileios Argyriou

TL;DR

The survey addresses the problem of understanding and generating human emotions across facial, vocal, and textual modalities. It surveys state-of-the-art methods in emotion recognition and generation, detailing preprocessing, datasets, architectures, and evaluation metrics, with emphasis on cross-modal integration and controllable generation. Key contributions include a comprehensive taxonomy of FER, SER, TSR, FEG, SEG, and TSG approaches, an analysis of evaluation frameworks, and a discussion of challenges such as data bias and ethical considerations. The work highlights the practical significance of robust, multimodal, and ethically responsible emotion-aware AI for applications in healthcare, customer service, and interactive agents, and outlines future directions including standardized benchmarks and multimodal fusion strategies.

Abstract

Emotion recognition and generation have emerged as crucial topics in Artificial Intelligence research, playing a significant role in enhancing human-computer interaction within healthcare, customer service, and other fields. Although several reviews have been conducted on emotion recognition and generation as separate entities, many of these works are either fragmented or limited to specific methodologies, lacking a comprehensive overview of recent developments and trends across different modalities. In this survey, we provide a holistic review aimed at researchers beginning their exploration in emotion recognition and generation. We introduce the fundamental principles underlying emotion recognition and generation across facial, vocal, and textual modalities. This work categorises recent state-of-the-art research into distinct technical approaches and explains the theoretical foundations and motivations behind these methodologies, offering a clearer understanding of their application. Moreover, we discuss evaluation metrics, comparative analyses, and current limitations, shedding light on the challenges faced by researchers in the field. Finally, we propose future research directions to address these challenges and encourage further exploration into developing robust, effective, and ethically responsible emotion recognition and generation systems.

Emotion Recognition and Generation: A Comprehensive Review of Face, Speech, and Text Modalities

TL;DR

The survey addresses the problem of understanding and generating human emotions across facial, vocal, and textual modalities. It surveys state-of-the-art methods in emotion recognition and generation, detailing preprocessing, datasets, architectures, and evaluation metrics, with emphasis on cross-modal integration and controllable generation. Key contributions include a comprehensive taxonomy of FER, SER, TSR, FEG, SEG, and TSG approaches, an analysis of evaluation frameworks, and a discussion of challenges such as data bias and ethical considerations. The work highlights the practical significance of robust, multimodal, and ethically responsible emotion-aware AI for applications in healthcare, customer service, and interactive agents, and outlines future directions including standardized benchmarks and multimodal fusion strategies.

Abstract

Emotion recognition and generation have emerged as crucial topics in Artificial Intelligence research, playing a significant role in enhancing human-computer interaction within healthcare, customer service, and other fields. Although several reviews have been conducted on emotion recognition and generation as separate entities, many of these works are either fragmented or limited to specific methodologies, lacking a comprehensive overview of recent developments and trends across different modalities. In this survey, we provide a holistic review aimed at researchers beginning their exploration in emotion recognition and generation. We introduce the fundamental principles underlying emotion recognition and generation across facial, vocal, and textual modalities. This work categorises recent state-of-the-art research into distinct technical approaches and explains the theoretical foundations and motivations behind these methodologies, offering a clearer understanding of their application. Moreover, we discuss evaluation metrics, comparative analyses, and current limitations, shedding light on the challenges faced by researchers in the field. Finally, we propose future research directions to address these challenges and encourage further exploration into developing robust, effective, and ethically responsible emotion recognition and generation systems.

Paper Structure

This paper contains 37 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The EmoFAN pipeline integrates facial landmark detection, discrete emotion classification, and continuous valence-arousal estimation in a single neural network. This unified model performs all tasks in one pass, using a face-alignment network and an attention mechanism to focus on key facial regions, enhancing accuracy. Joint prediction of both emotion types, combined with knowledge distillation, improves robustness.toisoul2021estimation
  • Figure 2: The SER model processes frame-level speech features as input, using a 2-layer LSTM to generate outputs aligned with each frame's corresponding time. The LSTM's internal forget gate has been replaced by an attention gate. To differentiate emotional nuances across time and feature dimensions, the model applies a weighting operation separately on the LSTM's output along both the time and feature dimensions. These two weighted outputs are then fed into fully connected layers, and the final output from the softmax layer provides the classification result.Xie2023
  • Figure 3: The TER system byKumar2022 uses a BERT-based dual-channel pipeline for text emotion recognition. First, input sentences are converted into contextual embeddings with a pre-trained BERT model. These embeddings are then processed through two parallel channels: one uses CNN for feature extraction followed by BiLSTM for capturing sequence information, while the other uses BiLSTM first, followed by CNN. The outputs from both channels are concatenated and passed through dense layers for emotion classification. An explainability module further interprets the model's predictions by analysing emotion embedding clusters.
  • Figure 4: In the place of 3D modelling, EMO utilising Stable Diffusion for generating new frames. The pipeline consists of a Backbone Network paired with a ReferenceNet to maintain identity consistency, audio-attention layers to synchronise facial expressions with audio tonalities, and temporal modules to ensure smooth transitions across frames. Weak control signals, such as a Face Locator and Speed Layers, provide loose guidance for facial positioning and movement velocity, achieving natural and stable head motions across clips.Tian2024emo
  • Figure 5: The PromptVC pipeline uses a latent diffusion model for voice style conversion using natural language prompts. During training, a style encoder extracts a global style vector from the input mel-spectrogram, while HuBERT-based discrete tokens capture linguistic content, refined by a differentiable duration predictor for accurate timing. A prosody encoder models phoneme-level prosody to enhance expressiveness. The latent diffusion model, conditioned on text embeddings, generates the style vector from noise, enabling flexible and precise style control.yao2024promptvc
  • ...and 1 more figures