Table of Contents
Fetching ...

A Review of Human Emotion Synthesis Based on Generative Technology

Fei Ma, Yukan Li, Yifan Xie, Ying He, Yi Zhang, Hongwei Ren, Zhou Liu, Wei Yao, Fuji Ren, Fei Richard Yu, Shiguang Ni

TL;DR

This paper delivers the first systematic survey of human emotion synthesis based on generative models, covering facial, speech, and textual modalities. It analyzes five foundational model families—Auto‑Encoders, GANs, Diffusion Models, Large Language Models, and Seq2Seq—alongside key datasets and evaluation metrics, drawing on over 230 papers published through 2024. The review finds diffusion models now offer strong, controllable performance across modalities, while LLMs and Seq2Seq approaches drive emotionally rich textual content, and AE/GAN models remain influential in facial expression tasks. It proposes future directions including hybrid architectures, cross‑modal and cross‑domain emotion synthesis, and edge‑device real‑time applications, highlighting significant implications for interactive AI, entertainment, and affective computing. Overall, the work provides a comprehensive foundation to guide researchers and practitioners in developing more authentic and contextually appropriate emotion synthesis systems.

Abstract

Human emotion synthesis is a crucial aspect of affective computing. It involves using computational methods to mimic and convey human emotions through various modalities, with the goal of enabling more natural and effective human-computer interactions. Recent advancements in generative models, such as Autoencoders, Generative Adversarial Networks, Diffusion Models, Large Language Models, and Sequence-to-Sequence Models, have significantly contributed to the development of this field. However, there is a notable lack of comprehensive reviews in this field. To address this problem, this paper aims to address this gap by providing a thorough and systematic overview of recent advancements in human emotion synthesis based on generative models. Specifically, this review will first present the review methodology, the emotion models involved, the mathematical principles of generative models, and the datasets used. Then, the review covers the application of different generative models to emotion synthesis based on a variety of modalities, including facial images, speech, and text. It also examines mainstream evaluation metrics. Additionally, the review presents some major findings and suggests future research directions, providing a comprehensive understanding of the role of generative technology in the nuanced domain of emotion synthesis.

A Review of Human Emotion Synthesis Based on Generative Technology

TL;DR

This paper delivers the first systematic survey of human emotion synthesis based on generative models, covering facial, speech, and textual modalities. It analyzes five foundational model families—Auto‑Encoders, GANs, Diffusion Models, Large Language Models, and Seq2Seq—alongside key datasets and evaluation metrics, drawing on over 230 papers published through 2024. The review finds diffusion models now offer strong, controllable performance across modalities, while LLMs and Seq2Seq approaches drive emotionally rich textual content, and AE/GAN models remain influential in facial expression tasks. It proposes future directions including hybrid architectures, cross‑modal and cross‑domain emotion synthesis, and edge‑device real‑time applications, highlighting significant implications for interactive AI, entertainment, and affective computing. Overall, the work provides a comprehensive foundation to guide researchers and practitioners in developing more authentic and contextually appropriate emotion synthesis systems.

Abstract

Human emotion synthesis is a crucial aspect of affective computing. It involves using computational methods to mimic and convey human emotions through various modalities, with the goal of enabling more natural and effective human-computer interactions. Recent advancements in generative models, such as Autoencoders, Generative Adversarial Networks, Diffusion Models, Large Language Models, and Sequence-to-Sequence Models, have significantly contributed to the development of this field. However, there is a notable lack of comprehensive reviews in this field. To address this problem, this paper aims to address this gap by providing a thorough and systematic overview of recent advancements in human emotion synthesis based on generative models. Specifically, this review will first present the review methodology, the emotion models involved, the mathematical principles of generative models, and the datasets used. Then, the review covers the application of different generative models to emotion synthesis based on a variety of modalities, including facial images, speech, and text. It also examines mainstream evaluation metrics. Additionally, the review presents some major findings and suggests future research directions, providing a comprehensive understanding of the role of generative technology in the nuanced domain of emotion synthesis.

Paper Structure

This paper contains 28 sections, 10 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Schematic Diagram of Generation Technology for Human Emotion Synthesis.
  • Figure 2: Taxonomy of This Survey.
  • Figure 3: A Comprehensive Review Methodology.
  • Figure 4: Plutchik Wheel (left) and 2D Emotion Model (right).
  • Figure 5: A mask-based GAN xue2024semantic for face reenactment. The system included four main components. A Semantic Mask Generator (SMG) produced masks for specific facial regions (eyes, mouth, cheeks). Then these masks were encoded into latent codes through an Adversarial Autoencoder (AAE). A Transformative Generator (TG) used these codes along with target expression labels to generate new facial expressions, with an AU-intensity Discriminator (AUD) that assessed their quality and intensity.
  • ...and 5 more figures