Table of Contents
Fetching ...

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

Ran Yi, Zipeng Ye, Juyong Zhang, Hujun Bao, Yong-Jin Liu

TL;DR

This work addresses realistic talking-face video synthesis with personalized head pose driven by an audio signal. It introduces a two-stage framework: Stage 1 learns an audio-to-expression/pose mapping and builds a personalized 3D facial animation via fine-tuning on a short target video; Stage 2 renders these animations and refines frames with a memory-augmented GAN that stores identity-feature memory for cross-subject generalization and applies background matching. The approach achieves high-quality lip synchronization and natural head movements for arbitrary source and target identities, significantly outperforming state-of-the-art 2D methods and enabling practical use with only about 300 frames for personalization. Extensive experiments and user studies validate the effectiveness of the personalized head-pose modeling and the frame-refinement network, illustrating the method's potential for realistic, controllable talking-face generation.

Abstract

Real-world talking faces often accompany with natural head movement. However, most existing talking face video generation methods only consider facial animation with fixed head pose. In this paper, we address this problem by proposing a deep neural network model that takes an audio signal A of a source person and a very short video V of a target person as input, and outputs a synthesized high-quality talking face video with personalized head pose (making use of the visual information in V), expression and lip synchronization (by considering both A and V). The most challenging issue in our work is that natural poses often cause in-plane and out-of-plane head rotations, which makes synthesized talking face video far from realistic. To address this challenge, we reconstruct 3D face animation and re-render it into synthesized frames. To fine tune these frames into realistic ones with smooth background transition, we propose a novel memory-augmented GAN module. By first training a general mapping based on a publicly available dataset and fine-tuning the mapping using the input short video of target person, we develop an effective strategy that only requires a small number of frames (about 300 frames) to learn personalized talking behavior including head pose. Extensive experiments and two user studies show that our method can generate high-quality (i.e., personalized head movements, expressions and good lip synchronization) talking face videos, which are naturally looking with more distinguishing head movement effects than the state-of-the-art methods.

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

TL;DR

This work addresses realistic talking-face video synthesis with personalized head pose driven by an audio signal. It introduces a two-stage framework: Stage 1 learns an audio-to-expression/pose mapping and builds a personalized 3D facial animation via fine-tuning on a short target video; Stage 2 renders these animations and refines frames with a memory-augmented GAN that stores identity-feature memory for cross-subject generalization and applies background matching. The approach achieves high-quality lip synchronization and natural head movements for arbitrary source and target identities, significantly outperforming state-of-the-art 2D methods and enabling practical use with only about 300 frames for personalization. Extensive experiments and user studies validate the effectiveness of the personalized head-pose modeling and the frame-refinement network, illustrating the method's potential for realistic, controllable talking-face generation.

Abstract

Real-world talking faces often accompany with natural head movement. However, most existing talking face video generation methods only consider facial animation with fixed head pose. In this paper, we address this problem by proposing a deep neural network model that takes an audio signal A of a source person and a very short video V of a target person as input, and outputs a synthesized high-quality talking face video with personalized head pose (making use of the visual information in V), expression and lip synchronization (by considering both A and V). The most challenging issue in our work is that natural poses often cause in-plane and out-of-plane head rotations, which makes synthesized talking face video far from realistic. To address this challenge, we reconstruct 3D face animation and re-render it into synthesized frames. To fine tune these frames into realistic ones with smooth background transition, we propose a novel memory-augmented GAN module. By first training a general mapping based on a publicly available dataset and fine-tuning the mapping using the input short video of target person, we develop an effective strategy that only requires a small number of frames (about 300 frames) to learn personalized talking behavior including head pose. Extensive experiments and two user studies show that our method can generate high-quality (i.e., personalized head movements, expressions and good lip synchronization) talking face videos, which are naturally looking with more distinguishing head movement effects than the state-of-the-art methods.

Paper Structure

This paper contains 22 sections, 6 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Flowchart of our method. (Stage 1) We train a general mapping from the input audio to the facial expression and common head pose. Then, we reconstruct the 3D face and fine tune the general mapping to learn personalized talking behavior from the input video. So we can obtain the 3D facial animation with personalized head pose. (Stage 2) We render the 3D facial animation into video frames using the texture and lighting information obtained from input video. Then we fine tune these synthesized frames into realistic frames using a novel memory-augmented GAN module.
  • Figure 2: Our memory-augmented GAN for refining rendered frames into realistic frames. The generator takes a window of rendering frames and an identity feature as input, and generate a refined frame based on attention mechanism. Discriminator judges whether a frame is real or not. The memory network is introduced to remember representative identities during training and retrieve the best-match identity feature during test. During the training, the memory network is updated by paired spatial features and ground-truth identity features. During the test, the memory network retrieves the best-match identity feature using the spatial feature as query.
  • Figure 3: Comparison of real videos with natural head pose and our generated talking face videos with personalized behavior. Our method can achieve both good lip synchronization and personalized head pose.
  • Figure 4: Our method works well for people of different races and ages.
  • Figure 5: Ablation study. The first row shows the ground truth video (a segment from Youtube video). The second row shows the generated results without pose estimation in the first stage. The third row shows the generated results by excluding the identity feature from input of GAN and the memory network from the GAN model. The last low shows the results of our full model.
  • ...and 4 more figures