Table of Contents
Fetching ...

VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer

Liyang Chen, Zhiyong Wu, Runnan Li, Weihong Bao, Jun Ling, Xu Tan, Sheng Zhao

TL;DR

An unsupervised variational style transfer model (VAST) to vivify the neutral photo-realistic avatars by able to flexibly capture the expressive facial style from arbitrary video prompts and transfer it onto a personalized image renderer in a zero-shot manner.

Abstract

Current talking face generation methods mainly focus on speech-lip synchronization. However, insufficient investigation on the facial talking style leads to a lifeless and monotonous avatar. Most previous works fail to imitate expressive styles from arbitrary video prompts and ensure the authenticity of the generated video. This paper proposes an unsupervised variational style transfer model (VAST) to vivify the neutral photo-realistic avatars. Our model consists of three key components: a style encoder that extracts facial style representations from the given video prompts; a hybrid facial expression decoder to model accurate speech-related movements; a variational style enhancer that enhances the style space to be highly expressive and meaningful. With our essential designs on facial style learning, our model is able to flexibly capture the expressive facial style from arbitrary video prompts and transfer it onto a personalized image renderer in a zero-shot manner. Experimental results demonstrate the proposed approach contributes to a more vivid talking avatar with higher authenticity and richer expressiveness.

VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer

TL;DR

An unsupervised variational style transfer model (VAST) to vivify the neutral photo-realistic avatars by able to flexibly capture the expressive facial style from arbitrary video prompts and transfer it onto a personalized image renderer in a zero-shot manner.

Abstract

Current talking face generation methods mainly focus on speech-lip synchronization. However, insufficient investigation on the facial talking style leads to a lifeless and monotonous avatar. Most previous works fail to imitate expressive styles from arbitrary video prompts and ensure the authenticity of the generated video. This paper proposes an unsupervised variational style transfer model (VAST) to vivify the neutral photo-realistic avatars. Our model consists of three key components: a style encoder that extracts facial style representations from the given video prompts; a hybrid facial expression decoder to model accurate speech-related movements; a variational style enhancer that enhances the style space to be highly expressive and meaningful. With our essential designs on facial style learning, our model is able to flexibly capture the expressive facial style from arbitrary video prompts and transfer it onto a personalized image renderer in a zero-shot manner. Experimental results demonstrate the proposed approach contributes to a more vivid talking avatar with higher authenticity and richer expressiveness.
Paper Structure (20 sections, 12 equations, 9 figures, 3 tables)

This paper contains 20 sections, 12 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Concept diagram for the proposed method. Expressive video prompt is employed to amend the expression prediction for vivid avatar generation.
  • Figure 2: Overview of the proposed method. $\mathbf{\Phi}$ denotes the encoding process of the parametric face model moai_2021 that extracts the facial parameters. $\mathbf{\Phi}^{\prime}$ denotes the decoding process to output mesh from the facial parameters. In training, the video prompt and speech are paired. In inference, the video prompt can come from any source.
  • Figure 3: Qualitative comparison results on GRID dataset. ATVG atvg produces blurry images. MakeItTalk makeittalk_2020 and Wav2Lip wav2lip_2020 generate wrong lip movements in cases. The video prompt to provide style for our method is randomly selected from other videos of the avatars that are intended to be synthesized.
  • Figure 4: Qualitative comparison results on Obama and HDTF datasets. The video prompts are sampled from the Ted-HD and HDTF datasets with styles like exciting-lecture (left female) and talk-show (right male). Our method is more like the facial style of the video prompts. Bigger mouth opening at vowels (e.g. /ei/ in make), tighter mouth shut at consonants (e.g. /m/ in them), and even biting the lip (e.g. /v/ in very) for our method.
  • Figure 5: Qualitative result for ablation on HDTF dataset. The avatar is speaking /ei/ of pray.
  • ...and 4 more figures