Table of Contents
Fetching ...

Say Anything with Any Style

Shuai Tan, Bin Ji, Yu Ding, Ye Pan

TL;DR

This work develops a multi-task VQ-VAE that incorporates three closely related tasks to learn a style codebook as a prior for style extraction, which enhances the precision and robustness when extracting the speaking styles of the given style clips.

Abstract

Generating stylized talking head with diverse head motions is crucial for achieving natural-looking videos but still remains challenging. Previous works either adopt a regressive method to capture the speaking style, resulting in a coarse style that is averaged across all training data, or employ a universal network to synthesize videos with different styles which causes suboptimal performance. To address these, we propose a novel dynamic-weight method, namely Say Anything withAny Style (SAAS), which queries the discrete style representation via a generative model with a learned style codebook. Specifically, we develop a multi-task VQ-VAE that incorporates three closely related tasks to learn a style codebook as a prior for style extraction. This discrete prior, along with the generative model, enhances the precision and robustness when extracting the speaking styles of the given style clips. By utilizing the extracted style, a residual architecture comprising a canonical branch and style-specific branch is employed to predict the mouth shapes conditioned on any driving audio while transferring the speaking style from the source to any desired one. To adapt to different speaking styles, we steer clear of employing a universal network by exploring an elaborate HyperStyle to produce the style-specific weights offset for the style branch. Furthermore, we construct a pose generator and a pose codebook to store the quantized pose representation, allowing us to sample diverse head motions aligned with the audio and the extracted style. Experiments demonstrate that our approach surpasses state-of-theart methods in terms of both lip-synchronization and stylized expression. Besides, we extend our SAAS to video-driven style editing field and achieve satisfactory performance.

Say Anything with Any Style

TL;DR

This work develops a multi-task VQ-VAE that incorporates three closely related tasks to learn a style codebook as a prior for style extraction, which enhances the precision and robustness when extracting the speaking styles of the given style clips.

Abstract

Generating stylized talking head with diverse head motions is crucial for achieving natural-looking videos but still remains challenging. Previous works either adopt a regressive method to capture the speaking style, resulting in a coarse style that is averaged across all training data, or employ a universal network to synthesize videos with different styles which causes suboptimal performance. To address these, we propose a novel dynamic-weight method, namely Say Anything withAny Style (SAAS), which queries the discrete style representation via a generative model with a learned style codebook. Specifically, we develop a multi-task VQ-VAE that incorporates three closely related tasks to learn a style codebook as a prior for style extraction. This discrete prior, along with the generative model, enhances the precision and robustness when extracting the speaking styles of the given style clips. By utilizing the extracted style, a residual architecture comprising a canonical branch and style-specific branch is employed to predict the mouth shapes conditioned on any driving audio while transferring the speaking style from the source to any desired one. To adapt to different speaking styles, we steer clear of employing a universal network by exploring an elaborate HyperStyle to produce the style-specific weights offset for the style branch. Furthermore, we construct a pose generator and a pose codebook to store the quantized pose representation, allowing us to sample diverse head motions aligned with the audio and the extracted style. Experiments demonstrate that our approach surpasses state-of-theart methods in terms of both lip-synchronization and stylized expression. Besides, we extend our SAAS to video-driven style editing field and achieve satisfactory performance.
Paper Structure (18 sections, 7 equations, 5 figures, 3 tables)

This paper contains 18 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Example animations generated by our SAAS. Given a source image and a style reference clip, SAAS generates stylized talking faces driven by audio. The lip motions are synchronized with the audio, while the speaking styles are controlled by the style clips. We also support video-driven style editing by inputting a source video.
  • Figure 2: (a) The overview of SAAS. We first extract expression coefficients $\beta^s_{1:T'}$ from style reference video $V^s$ by 3DMM and extract the style code $s$. Audio Encoder $E_a$ encodes coefficient $\beta^r$ of source image and driving audio $a_{1:T}$ into $z^a_{1:T}$, which is fed into canonical branch $\phi_c$ and style-specific branch $\phi_s$. To generate stylized motion, $\phi_s$ accept the style-specific weights produced by HyperStyle $H$ and transfer $z^a_{1:T}$ into stylized $z^s_{1:T}$. Decoder $D$ reconstructs the coefficients $\hat{\beta^s_{1:T}}$ and Face Render $R$ synthesise the stylized video $\hat{V^s}$ along with the predicted head pose $\hat{p_{1:T}}$ by proposed Pose Generator $G_p$. (b) The pipeline of Style Extraction. The dotted arrow indicates the processes in $C_s$ training phase.
  • Figure 3: The pipeline of pose generator $G_p$.
  • Figure 4: Qualitative comparisons with state-of-the-art methods. Top row shows the identity, driving audio and corresponding mouth ground truth. The purple row shows the style source clips.
  • Figure 5: Visualization results of ablation study.