A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and Editing
Ming Meng, Yufei Zhao, Bo Zhang, Yonggui Zhu, Weimin Shi, Maxwell Wen, Zhaoxin Fan
TL;DR
This survey analyzes talking head synthesis through three interconnected domains: portrait generation, driving mechanisms, and editing. It reviews unconditional and conditional portrait generation (GANs/VAEs, autoregressive models, NeRFs, and text/label-guided methods), discusses video- and audio-driven driving (including traditional, learning-based, and diffusion-based approaches), and details 2D/3D editing techniques with disentanglement and multimodal integration. The paper compiles benchmarking datasets and metrics to evaluate identity preservation, visual quality, lip-sync accuracy, and motion realism, and highlights datasets such as VoxCeleb1/2, LRW, CREMA-D, and TalkingHead-1KH. Key contributions include organizing the field into a coherent framework, comparing state-of-the-art methods, and outlining future directions to address data bias, temporal consistency, large-angle driving, and computational efficiency, with applications spanning film, gaming, and virtual communication.
Abstract
Talking head synthesis, an advanced method for generating portrait videos from a still image driven by specific content, has garnered widespread attention in virtual reality, augmented reality and game production. Recently, significant breakthroughs have been made with the introduction of novel models such as the transformer and the diffusion model. Current methods can not only generate new content but also edit the generated material. This survey systematically reviews the technology, categorizing it into three pivotal domains: portrait generation, driven mechanisms, and editing techniques. We summarize milestone studies and critically analyze their innovations and shortcomings within each domain. Additionally, we organize an extensive collection of datasets and provide a thorough performance analysis of current methodologies based on various evaluation metrics, aiming to furnish a clear framework and robust data support for future research. Finally, we explore application scenarios of talking head synthesis, illustrate them with specific cases, and examine potential future directions.
