A Review of Human Emotion Synthesis Based on Generative Technology
Fei Ma, Yukan Li, Yifan Xie, Ying He, Yi Zhang, Hongwei Ren, Zhou Liu, Wei Yao, Fuji Ren, Fei Richard Yu, Shiguang Ni
TL;DR
This paper delivers the first systematic survey of human emotion synthesis based on generative models, covering facial, speech, and textual modalities. It analyzes five foundational model families—Auto‑Encoders, GANs, Diffusion Models, Large Language Models, and Seq2Seq—alongside key datasets and evaluation metrics, drawing on over 230 papers published through 2024. The review finds diffusion models now offer strong, controllable performance across modalities, while LLMs and Seq2Seq approaches drive emotionally rich textual content, and AE/GAN models remain influential in facial expression tasks. It proposes future directions including hybrid architectures, cross‑modal and cross‑domain emotion synthesis, and edge‑device real‑time applications, highlighting significant implications for interactive AI, entertainment, and affective computing. Overall, the work provides a comprehensive foundation to guide researchers and practitioners in developing more authentic and contextually appropriate emotion synthesis systems.
Abstract
Human emotion synthesis is a crucial aspect of affective computing. It involves using computational methods to mimic and convey human emotions through various modalities, with the goal of enabling more natural and effective human-computer interactions. Recent advancements in generative models, such as Autoencoders, Generative Adversarial Networks, Diffusion Models, Large Language Models, and Sequence-to-Sequence Models, have significantly contributed to the development of this field. However, there is a notable lack of comprehensive reviews in this field. To address this problem, this paper aims to address this gap by providing a thorough and systematic overview of recent advancements in human emotion synthesis based on generative models. Specifically, this review will first present the review methodology, the emotion models involved, the mathematical principles of generative models, and the datasets used. Then, the review covers the application of different generative models to emotion synthesis based on a variety of modalities, including facial images, speech, and text. It also examines mainstream evaluation metrics. Additionally, the review presents some major findings and suggests future research directions, providing a comprehensive understanding of the role of generative technology in the nuanced domain of emotion synthesis.
