Table of Contents
Fetching ...

SPEAK: Speech-Driven Pose and Emotion-Adjustable Talking Head Generation

Changpeng Cai, Guinan Guo, Jiao Li, Junhao Su, Fei Shen, Chenghao He, Jing Xiao, Yuanxu Chen, Lei Dai, Feiyu Zhu

TL;DR

A novel one-shot Talking Head Generation framework (SPEAK) that distinguishes itself from the general Talking Face Generation by enabling emotional and postural control and a novel generator that employs modified latent codes derived from the editing module to regulate emotional expression, head poses, and speech content in synthesizing facial animations.

Abstract

Most earlier researches on talking face generation have focused on the synchronization of lip motion and speech content. However, head pose and facial emotions are equally important characteristics of natural faces. While audio-driven talking face generation has seen notable advancements, existing methods either overlook facial emotions or are limited to specific individuals and cannot be applied to arbitrary subjects. In this paper, we propose a novel one-shot Talking Head Generation framework (SPEAK) that distinguishes itself from the general Talking Face Generation by enabling emotional and postural control. Specifically, we introduce Inter-Reconstructed Feature Disentanglement (IRFD) module to decouple facial features into three latent spaces. Then we design a face editing module that modifies speech content and facial latent codes into a single latent space. Subsequently, we present a novel generator that employs modified latent codes derived from the editing module to regulate emotional expression, head poses, and speech content in synthesizing facial animations. Extensive trials demonstrate that our method ensures lip synchronization with the audio while enabling decoupled control of facial features, it can generate realistic talking head with coordinated lip motions, authentic facial emotions, and smooth head movements. The demo video is available: https://anonymous.4open.science/r/SPEAK-8A22

SPEAK: Speech-Driven Pose and Emotion-Adjustable Talking Head Generation

TL;DR

A novel one-shot Talking Head Generation framework (SPEAK) that distinguishes itself from the general Talking Face Generation by enabling emotional and postural control and a novel generator that employs modified latent codes derived from the editing module to regulate emotional expression, head poses, and speech content in synthesizing facial animations.

Abstract

Most earlier researches on talking face generation have focused on the synchronization of lip motion and speech content. However, head pose and facial emotions are equally important characteristics of natural faces. While audio-driven talking face generation has seen notable advancements, existing methods either overlook facial emotions or are limited to specific individuals and cannot be applied to arbitrary subjects. In this paper, we propose a novel one-shot Talking Head Generation framework (SPEAK) that distinguishes itself from the general Talking Face Generation by enabling emotional and postural control. Specifically, we introduce Inter-Reconstructed Feature Disentanglement (IRFD) module to decouple facial features into three latent spaces. Then we design a face editing module that modifies speech content and facial latent codes into a single latent space. Subsequently, we present a novel generator that employs modified latent codes derived from the editing module to regulate emotional expression, head poses, and speech content in synthesizing facial animations. Extensive trials demonstrate that our method ensures lip synchronization with the audio while enabling decoupled control of facial features, it can generate realistic talking head with coordinated lip motions, authentic facial emotions, and smooth head movements. The demo video is available: https://anonymous.4open.science/r/SPEAK-8A22
Paper Structure (6 sections, 5 equations, 3 figures, 3 tables)

This paper contains 6 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of our proposed Talking Head Generation Framework. Our framework first extracts human face. To begin with, We employ the IRFD to decouple facial features from video clips $I$,$P_{(1:i)}$,$Em_{(1:i)}$ onto three latent spaces$f_{I_{i}}$, $f_{P_{i}}$, $f_{E_{i}}$. An audio encoder encodes speech wavform into audio content features $f_{a_{(1:i)}}$. Then the editing module aligning audio content $f_{a_{(1:i)}}$ and facial information $f_{R_{(1:i)}}^{'}$ modalities.
  • Figure 2: Qualitative comparisons with other baselines. The top two rows show the Identity, Reference Source (video frames after the fusion of emotion and pose) and Audio. Since it is the first method to generate videos using four types of input data, there's no prior ground-truth with all four inputs. We combine Audio and Reference Source to generate mouth shape as Mouth GT.
  • Figure 3: Visualized results of SPEAK ablation study.