Table of Contents
Fetching ...

Generative Technology for Human Emotion Recognition: A Scope Review

Fei Ma, Yucheng Yuan, Yifan Xie, Hongwei Ren, Ivan Liu, Ying He, Fuji Ren, Fei Richard Yu, Shiguang Ni

TL;DR

This work addresses the problem of limited understanding at the intersection of generative modeling and emotion recognition. It surveys a broad landscape of generative techniques—autoencoders, GANs, diffusion models, and large language models—across SER, FER, TER, physiological signals, and MER, organizing findings into a taxonomy focused on data augmentation, feature extraction, semi-supervised learning, cross-domain transfer, and adversarial robustness. The paper makes four core contributions: (i) first systematic review of generative tech for emotion recognition, (ii) analysis of 320+ papers with modality-aware taxonomy and dataset benchmarking, (iii) synthesis of practical insights and performance trends, and (iv) forward-looking guidance on combining diffusion models with transformers, RL/FL integration, VR/AR applications, and content synthesis. The findings highlight that FER currently benefits most from generative methods, while DM and LLMs are emerging, with cross-modal fusion and privacy considerations as central practical concerns for real-world deployment.

Abstract

Affective computing stands at the forefront of artificial intelligence (AI), seeking to imbue machines with the ability to comprehend and respond to human emotions. Central to this field is emotion recognition, which endeavors to identify and interpret human emotional states from different modalities, such as speech, facial images, text, and physiological signals. In recent years, important progress has been made in generative models, including Autoencoder, Generative Adversarial Network, Diffusion Model, and Large Language Model. These models, with their powerful data generation capabilities, emerge as pivotal tools in advancing emotion recognition. However, up to now, there remains a paucity of systematic efforts that review generative technology for emotion recognition. This survey aims to bridge the gaps in the existing literature by conducting a comprehensive analysis of over 320 research papers until June 2024. Specifically, this survey will firstly introduce the mathematical principles of different generative models and the commonly used datasets. Subsequently, through a taxonomy, it will provide an in-depth analysis of how generative techniques address emotion recognition based on different modalities in several aspects, including data augmentation, feature extraction, semi-supervised learning, cross-domain, etc. Finally, the review will outline future research directions, emphasizing the potential of generative models to advance the field of emotion recognition and enhance the emotional intelligence of AI systems.

Generative Technology for Human Emotion Recognition: A Scope Review

TL;DR

This work addresses the problem of limited understanding at the intersection of generative modeling and emotion recognition. It surveys a broad landscape of generative techniques—autoencoders, GANs, diffusion models, and large language models—across SER, FER, TER, physiological signals, and MER, organizing findings into a taxonomy focused on data augmentation, feature extraction, semi-supervised learning, cross-domain transfer, and adversarial robustness. The paper makes four core contributions: (i) first systematic review of generative tech for emotion recognition, (ii) analysis of 320+ papers with modality-aware taxonomy and dataset benchmarking, (iii) synthesis of practical insights and performance trends, and (iv) forward-looking guidance on combining diffusion models with transformers, RL/FL integration, VR/AR applications, and content synthesis. The findings highlight that FER currently benefits most from generative methods, while DM and LLMs are emerging, with cross-modal fusion and privacy considerations as central practical concerns for real-world deployment.

Abstract

Affective computing stands at the forefront of artificial intelligence (AI), seeking to imbue machines with the ability to comprehend and respond to human emotions. Central to this field is emotion recognition, which endeavors to identify and interpret human emotional states from different modalities, such as speech, facial images, text, and physiological signals. In recent years, important progress has been made in generative models, including Autoencoder, Generative Adversarial Network, Diffusion Model, and Large Language Model. These models, with their powerful data generation capabilities, emerge as pivotal tools in advancing emotion recognition. However, up to now, there remains a paucity of systematic efforts that review generative technology for emotion recognition. This survey aims to bridge the gaps in the existing literature by conducting a comprehensive analysis of over 320 research papers until June 2024. Specifically, this survey will firstly introduce the mathematical principles of different generative models and the commonly used datasets. Subsequently, through a taxonomy, it will provide an in-depth analysis of how generative techniques address emotion recognition based on different modalities in several aspects, including data augmentation, feature extraction, semi-supervised learning, cross-domain, etc. Finally, the review will outline future research directions, emphasizing the potential of generative models to advance the field of emotion recognition and enhance the emotional intelligence of AI systems.
Paper Structure (27 sections, 5 equations, 14 figures, 11 tables)

This paper contains 27 sections, 5 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Schematic Diagram of Generation technology for Emotion Recognition.
  • Figure 2: Taxonomy of This Survey.
  • Figure 3: The pipeline of data augmentation from ma2022data: First, a generative model is employed to generate new data, then the original training set is combined with the generated data, creating an augmented training set for the emotion classification task.
  • Figure 4: The framework designed for extracting speech features from Zhang2021: It introduces instance normalization and an emotion embedding path to guide the AE in learning a priori knowledge from the label, enabling it to distinguish the most emotion-related features. The latent representation learned by the AE, enhanced with self-attention, is concatenated with acoustic features obtained using the openSMILE toolkit, and the resulting feature vector is then used for emotion classification.
  • Figure 5: A semi-supervised learning framework in SER from Zhao2020: A generator creates synthetic audio descriptors from noise. These descriptors, along with real ones from the openSMILE toolkit, are fed to a discriminator. The discriminator is trained with supervised and unsupervised loss functions to better distinguish real from fake audio cues.
  • ...and 9 more figures