Table of Contents
Fetching ...

Controllable Talking Face Generation by Implicit Facial Keypoints Editing

Dong Zhao, Jiaying Shi, Wenjun Li, Shudong Wang, Shenghui Xu, Zhaoming Pan

TL;DR

ControlTalk is presented, a talking face generation method to control face expression deformation based on driven audio, which can construct the head pose and facial expression including lip motion for both single image or sequential video inputs in a unified manner.

Abstract

Audio-driven talking face generation has garnered significant interest within the domain of digital human research. Existing methods are encumbered by intricate model architectures that are intricately dependent on each other, complicating the process of re-editing image or video inputs. In this work, we present ControlTalk, a talking face generation method to control face expression deformation based on driven audio, which can construct the head pose and facial expression including lip motion for both single image or sequential video inputs in a unified manner. By utilizing a pre-trained video synthesis renderer and proposing the lightweight adaptation, ControlTalk achieves precise and naturalistic lip synchronization while enabling quantitative control over mouth opening shape. Our experiments show that our method is superior to state-of-the-art performance on widely used benchmarks, including HDTF and MEAD. The parameterized adaptation demonstrates remarkable generalization capabilities, effectively handling expression deformation across same-ID and cross-ID scenarios, and extending its utility to out-of-domain portraits, regardless of languages. Code is available at https://github.com/NetEase-Media/ControlTalk.

Controllable Talking Face Generation by Implicit Facial Keypoints Editing

TL;DR

ControlTalk is presented, a talking face generation method to control face expression deformation based on driven audio, which can construct the head pose and facial expression including lip motion for both single image or sequential video inputs in a unified manner.

Abstract

Audio-driven talking face generation has garnered significant interest within the domain of digital human research. Existing methods are encumbered by intricate model architectures that are intricately dependent on each other, complicating the process of re-editing image or video inputs. In this work, we present ControlTalk, a talking face generation method to control face expression deformation based on driven audio, which can construct the head pose and facial expression including lip motion for both single image or sequential video inputs in a unified manner. By utilizing a pre-trained video synthesis renderer and proposing the lightweight adaptation, ControlTalk achieves precise and naturalistic lip synchronization while enabling quantitative control over mouth opening shape. Our experiments show that our method is superior to state-of-the-art performance on widely used benchmarks, including HDTF and MEAD. The parameterized adaptation demonstrates remarkable generalization capabilities, effectively handling expression deformation across same-ID and cross-ID scenarios, and extending its utility to out-of-domain portraits, regardless of languages. Code is available at https://github.com/NetEase-Media/ControlTalk.
Paper Structure (13 sections, 6 equations, 11 figures, 1 table)

This paper contains 13 sections, 6 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: An overview of ControlTalk. Our method consists of 4 modules, but only Audio2Exp participates in training to simplify the whole process. In the training process, audio and video are used as inputs, and the speech features and parameterized coefficients are extracted by the pre-trained model respectively, which are subsequently converted into lip-synced expression coefficients through Audio2Exp. Finally, the input video frame and parameterized coefficients including new expression coefficients would be rendered to the generated talking face video. In the inference phase, image input is also supported with driven motions.
  • Figure 2: Silent audio training for adjustable talking mouth. Silent audio would first control the predicted expression, and then the final expression is synchronized by input audio through Audio2Exp.
  • Figure 3: The combination of two losses. Perceptual loss and lip-sync loss are used in different areas of the image.
  • Figure 4: Detailed comparisons of different methods. The red arrow points out the mouth box of the DINet.
  • Figure 5: Qualitative comparisons with same-ID. The input audio and portrait are the same identity, and all dubbing videos and reference videos come from the same ID.
  • ...and 6 more figures