Table of Contents
Fetching ...

PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation

Baiqin Wang, Xiangyu Zhu, Fan Shen, Hao Xu, Zhen Lei

TL;DR

PC-Talk tackles the lack of controllability in audio-driven talking-face generation by proposing implicit keypoint deformations as an intermediate representation and introducing two dedicated modules: Lip-audio Alignment Control (LAC) for precise lip-sync and speaking-style editing, and EMotion Control (EMC) for fine-grained emotion synthesis. The method enables word-level style edits, lip-movement scaling, and region-wise emotion composition by disentangling pure emotional deformation from lip-sync, using multiple emotional sources. Evaluations on HDTF and MEAD show state-of-the-art lip synchronization, image quality, temporal consistency, and emotional expressiveness, with real-time performance at 30 FPS. This work advances practical, controllable digital humans by providing fine-grained, multi-source emotion control and robust lip-sync, enabling customizable, realistic talking-face videos.

Abstract

Recent advancements in audio-driven talking face generation have made great progress in lip synchronization. However, current methods often lack sufficient control over facial animation such as speaking style and emotional expression, resulting in uniform outputs. In this paper, we focus on improving two key factors: lip-audio alignment and emotion control, to enhance the diversity and user-friendliness of talking videos. Lip-audio alignment control focuses on elements like speaking style and the scale of lip movements, whereas emotion control is centered on generating realistic emotional expressions, allowing for modifications in multiple attributes such as intensity. To achieve precise control of facial animation, we propose a novel framework, PC-Talk, which enables lip-audio alignment and emotion control through implicit keypoint deformations. First, our lip-audio alignment control module facilitates precise editing of speaking styles at the word level and adjusts lip movement scales to simulate varying vocal loudness levels, maintaining lip synchronization with the audio. Second, our emotion control module generates vivid emotional facial features with pure emotional deformation. This module also enables the fine modification of intensity and the combination of multiple emotions across different facial regions. Our method demonstrates outstanding control capabilities and achieves state-of-the-art performance on both HDTF and MEAD datasets in extensive experiments.

PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation

TL;DR

PC-Talk tackles the lack of controllability in audio-driven talking-face generation by proposing implicit keypoint deformations as an intermediate representation and introducing two dedicated modules: Lip-audio Alignment Control (LAC) for precise lip-sync and speaking-style editing, and EMotion Control (EMC) for fine-grained emotion synthesis. The method enables word-level style edits, lip-movement scaling, and region-wise emotion composition by disentangling pure emotional deformation from lip-sync, using multiple emotional sources. Evaluations on HDTF and MEAD show state-of-the-art lip synchronization, image quality, temporal consistency, and emotional expressiveness, with real-time performance at 30 FPS. This work advances practical, controllable digital humans by providing fine-grained, multi-source emotion control and robust lip-sync, enabling customizable, realistic talking-face videos.

Abstract

Recent advancements in audio-driven talking face generation have made great progress in lip synchronization. However, current methods often lack sufficient control over facial animation such as speaking style and emotional expression, resulting in uniform outputs. In this paper, we focus on improving two key factors: lip-audio alignment and emotion control, to enhance the diversity and user-friendliness of talking videos. Lip-audio alignment control focuses on elements like speaking style and the scale of lip movements, whereas emotion control is centered on generating realistic emotional expressions, allowing for modifications in multiple attributes such as intensity. To achieve precise control of facial animation, we propose a novel framework, PC-Talk, which enables lip-audio alignment and emotion control through implicit keypoint deformations. First, our lip-audio alignment control module facilitates precise editing of speaking styles at the word level and adjusts lip movement scales to simulate varying vocal loudness levels, maintaining lip synchronization with the audio. Second, our emotion control module generates vivid emotional facial features with pure emotional deformation. This module also enables the fine modification of intensity and the combination of multiple emotions across different facial regions. Our method demonstrates outstanding control capabilities and achieves state-of-the-art performance on both HDTF and MEAD datasets in extensive experiments.

Paper Structure

This paper contains 24 sections, 16 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Facial animation control proposed in PC-Talk. We divided it into two categories: Lip-Audio alignment control and Emotion control. Lip-Audio alignment control module is able to control and edit talking style. It also supports modifying the scale of lip movement. Emotion control module allows generate emotional talking faces with different intensities from multiple sources. It also enables composite complex emotional expressions with different emotions in each facial region.
  • Figure 2: Our framework PC-Talk is designed for precise facial animation control in talking face generation. It achieves this control by first predicting a deformation of implicit keypoints and then rendering it into a final talking image. We utilize a Lip-audio Alignment Control (LAC) module to estimate lip-sync deformations $D_l$ and an EMotion Control (EMC) module to estimate emotional deformations $D_e$.
  • Figure 3: Data augmentation using video-driven portrait animation from same framework.
  • Figure 4: Comparison with other baselines. We highlight flaws of other methods using colorful bounding boxes, including blurry teeth, inaccurate lip shapes, and incorrect emotional expressions.
  • Figure 5: Abalation study on subtract operation.
  • ...and 4 more figures