Controllable Talking Face Generation by Implicit Facial Keypoints Editing

Dong Zhao; Jiaying Shi; Wenjun Li; Shudong Wang; Shenghui Xu; Zhaoming Pan

Controllable Talking Face Generation by Implicit Facial Keypoints Editing

Dong Zhao, Jiaying Shi, Wenjun Li, Shudong Wang, Shenghui Xu, Zhaoming Pan

TL;DR

ControlTalk is presented, a talking face generation method to control face expression deformation based on driven audio, which can construct the head pose and facial expression including lip motion for both single image or sequential video inputs in a unified manner.

Abstract

Audio-driven talking face generation has garnered significant interest within the domain of digital human research. Existing methods are encumbered by intricate model architectures that are intricately dependent on each other, complicating the process of re-editing image or video inputs. In this work, we present ControlTalk, a talking face generation method to control face expression deformation based on driven audio, which can construct the head pose and facial expression including lip motion for both single image or sequential video inputs in a unified manner. By utilizing a pre-trained video synthesis renderer and proposing the lightweight adaptation, ControlTalk achieves precise and naturalistic lip synchronization while enabling quantitative control over mouth opening shape. Our experiments show that our method is superior to state-of-the-art performance on widely used benchmarks, including HDTF and MEAD. The parameterized adaptation demonstrates remarkable generalization capabilities, effectively handling expression deformation across same-ID and cross-ID scenarios, and extending its utility to out-of-domain portraits, regardless of languages. Code is available at https://github.com/NetEase-Media/ControlTalk.

Controllable Talking Face Generation by Implicit Facial Keypoints Editing

TL;DR

Abstract

Paper Structure (13 sections, 6 equations, 11 figures, 1 table)

This paper contains 13 sections, 6 equations, 11 figures, 1 table.

Introduction
Related Work
Method
ControlTalk
Audio2Exp
Adjustable Talking Mouth
Losses
Experiments
Experimental Setup
Audio-driven Talking Face Generation
Ablation Study
Generalization
Conclusion and Discussion

Figures (11)

Figure 1: An overview of ControlTalk. Our method consists of 4 modules, but only Audio2Exp participates in training to simplify the whole process. In the training process, audio and video are used as inputs, and the speech features and parameterized coefficients are extracted by the pre-trained model respectively, which are subsequently converted into lip-synced expression coefficients through Audio2Exp. Finally, the input video frame and parameterized coefficients including new expression coefficients would be rendered to the generated talking face video. In the inference phase, image input is also supported with driven motions.
Figure 2: Silent audio training for adjustable talking mouth. Silent audio would first control the predicted expression, and then the final expression is synchronized by input audio through Audio2Exp.
Figure 3: The combination of two losses. Perceptual loss and lip-sync loss are used in different areas of the image.
Figure 4: Detailed comparisons of different methods. The red arrow points out the mouth box of the DINet.
Figure 5: Qualitative comparisons with same-ID. The input audio and portrait are the same identity, and all dubbing videos and reference videos come from the same ID.
...and 6 more figures

Controllable Talking Face Generation by Implicit Facial Keypoints Editing

TL;DR

Abstract

Controllable Talking Face Generation by Implicit Facial Keypoints Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (11)