Table of Contents
Fetching ...

A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and Editing

Ming Meng, Yufei Zhao, Bo Zhang, Yonggui Zhu, Weimin Shi, Maxwell Wen, Zhaoxin Fan

TL;DR

This survey analyzes talking head synthesis through three interconnected domains: portrait generation, driving mechanisms, and editing. It reviews unconditional and conditional portrait generation (GANs/VAEs, autoregressive models, NeRFs, and text/label-guided methods), discusses video- and audio-driven driving (including traditional, learning-based, and diffusion-based approaches), and details 2D/3D editing techniques with disentanglement and multimodal integration. The paper compiles benchmarking datasets and metrics to evaluate identity preservation, visual quality, lip-sync accuracy, and motion realism, and highlights datasets such as VoxCeleb1/2, LRW, CREMA-D, and TalkingHead-1KH. Key contributions include organizing the field into a coherent framework, comparing state-of-the-art methods, and outlining future directions to address data bias, temporal consistency, large-angle driving, and computational efficiency, with applications spanning film, gaming, and virtual communication.

Abstract

Talking head synthesis, an advanced method for generating portrait videos from a still image driven by specific content, has garnered widespread attention in virtual reality, augmented reality and game production. Recently, significant breakthroughs have been made with the introduction of novel models such as the transformer and the diffusion model. Current methods can not only generate new content but also edit the generated material. This survey systematically reviews the technology, categorizing it into three pivotal domains: portrait generation, driven mechanisms, and editing techniques. We summarize milestone studies and critically analyze their innovations and shortcomings within each domain. Additionally, we organize an extensive collection of datasets and provide a thorough performance analysis of current methodologies based on various evaluation metrics, aiming to furnish a clear framework and robust data support for future research. Finally, we explore application scenarios of talking head synthesis, illustrate them with specific cases, and examine potential future directions.

A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and Editing

TL;DR

This survey analyzes talking head synthesis through three interconnected domains: portrait generation, driving mechanisms, and editing. It reviews unconditional and conditional portrait generation (GANs/VAEs, autoregressive models, NeRFs, and text/label-guided methods), discusses video- and audio-driven driving (including traditional, learning-based, and diffusion-based approaches), and details 2D/3D editing techniques with disentanglement and multimodal integration. The paper compiles benchmarking datasets and metrics to evaluate identity preservation, visual quality, lip-sync accuracy, and motion realism, and highlights datasets such as VoxCeleb1/2, LRW, CREMA-D, and TalkingHead-1KH. Key contributions include organizing the field into a coherent framework, comparing state-of-the-art methods, and outlining future directions to address data bias, temporal consistency, large-angle driving, and computational efficiency, with applications spanning film, gaming, and virtual communication.

Abstract

Talking head synthesis, an advanced method for generating portrait videos from a still image driven by specific content, has garnered widespread attention in virtual reality, augmented reality and game production. Recently, significant breakthroughs have been made with the introduction of novel models such as the transformer and the diffusion model. Current methods can not only generate new content but also edit the generated material. This survey systematically reviews the technology, categorizing it into three pivotal domains: portrait generation, driven mechanisms, and editing techniques. We summarize milestone studies and critically analyze their innovations and shortcomings within each domain. Additionally, we organize an extensive collection of datasets and provide a thorough performance analysis of current methodologies based on various evaluation metrics, aiming to furnish a clear framework and robust data support for future research. Finally, we explore application scenarios of talking head synthesis, illustrate them with specific cases, and examine potential future directions.
Paper Structure (37 sections, 7 equations, 4 figures, 7 tables)

This paper contains 37 sections, 7 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Chronological overview of the talking head synthesis's three primary stages. The bottom-up order is portrait generation, driven mechanisms and editing techniques. Pentagrams mark more influential methods in each stage.
  • Figure 2: An overview of this survey.
  • Figure 3: The comprehensive process of talking head synthesis involves three key stages. First, portrait generation creates a static image from random noise, optionally incorporating specified attributes. Second, the driving mechanism animates this image using video frames or audio content through the extraction and application of intermediate representations like keypoints and landmarks. Third, the editing technique refines the animated output, enhancing visual coherence and quality by precisely adjusting pose and expression parameters.
  • Figure 4: Typical structures for cutting-edge deep generation models of audio-driven branch. (a) GAN-based audiovisual synthesis methods. (b) VAE-based audiovisual mapping methods. (c) Transformer-based sequence translation. (d) Implicit field animation modeling. (e) Diffusive-based expression synthesis methods.