Table of Contents
Fetching ...

StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads

Suzhen Wang, Yifeng Ma, Yu Ding, Zhipeng Hu, Changjie Fan, Tangjie Lv, Zhidong Deng, Xin Yu

TL;DR

StyleTalk++ tackles the challenge of one-shot talking-head generation with personalized speaking styles by learning a universal spatio-temporal style space from reference videos and injecting it into audio-driven 3DMM parameters. The framework projects styles into two branches—stylized facial expressions and stylized head poses—via a universal style encoder and style-aware decoders, then renders photorealistic results with an image renderer. Its key contributions include a triplet-based style space, adaptive style-conditioned transformers for expressions, and Transformer-XL-based recurrence for natural head motion, enabling diverse, precise lip-sync and style fidelity across unseen speakers. The approach demonstrates strong quantitative and qualitative gains over state-of-the-art baselines on multiple datasets, with meaningful style interpolation and robust performance in realistic scenarios.

Abstract

Individuals have unique facial expression and head pose styles that reflect their personalized speaking styles. Existing one-shot talking head methods cannot capture such personalized characteristics and therefore fail to produce diverse speaking styles in the final videos. To address this challenge, we propose a one-shot style-controllable talking face generation method that can obtain speaking styles from reference speaking videos and drive the one-shot portrait to speak with the reference speaking styles and another piece of audio. Our method aims to synthesize the style-controllable coefficients of a 3D Morphable Model (3DMM), including facial expressions and head movements, in a unified framework. Specifically, the proposed framework first leverages a style encoder to extract the desired speaking styles from the reference videos and transform them into style codes. Then, the framework uses a style-aware decoder to synthesize the coefficients of 3DMM from the audio input and style codes. During decoding, our framework adopts a two-branch architecture, which generates the stylized facial expression coefficients and stylized head movement coefficients, respectively. After obtaining the coefficients of 3DMM, an image renderer renders the expression coefficients into a specific person's talking-head video. Extensive experiments demonstrate that our method generates visually authentic talking head videos with diverse speaking styles from only one portrait image and an audio clip.

StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads

TL;DR

StyleTalk++ tackles the challenge of one-shot talking-head generation with personalized speaking styles by learning a universal spatio-temporal style space from reference videos and injecting it into audio-driven 3DMM parameters. The framework projects styles into two branches—stylized facial expressions and stylized head poses—via a universal style encoder and style-aware decoders, then renders photorealistic results with an image renderer. Its key contributions include a triplet-based style space, adaptive style-conditioned transformers for expressions, and Transformer-XL-based recurrence for natural head motion, enabling diverse, precise lip-sync and style fidelity across unseen speakers. The approach demonstrates strong quantitative and qualitative gains over state-of-the-art baselines on multiple datasets, with meaningful style interpolation and robust performance in realistic scenarios.

Abstract

Individuals have unique facial expression and head pose styles that reflect their personalized speaking styles. Existing one-shot talking head methods cannot capture such personalized characteristics and therefore fail to produce diverse speaking styles in the final videos. To address this challenge, we propose a one-shot style-controllable talking face generation method that can obtain speaking styles from reference speaking videos and drive the one-shot portrait to speak with the reference speaking styles and another piece of audio. Our method aims to synthesize the style-controllable coefficients of a 3D Morphable Model (3DMM), including facial expressions and head movements, in a unified framework. Specifically, the proposed framework first leverages a style encoder to extract the desired speaking styles from the reference videos and transform them into style codes. Then, the framework uses a style-aware decoder to synthesize the coefficients of 3DMM from the audio input and style codes. During decoding, our framework adopts a two-branch architecture, which generates the stylized facial expression coefficients and stylized head movement coefficients, respectively. After obtaining the coefficients of 3DMM, an image renderer renders the expression coefficients into a specific person's talking-head video. Extensive experiments demonstrate that our method generates visually authentic talking head videos with diverse speaking styles from only one portrait image and an audio clip.
Paper Structure (45 sections, 18 equations, 15 figures, 6 tables)

This paper contains 45 sections, 18 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Illustration of StyleTalk++. Our method can control both facial expression and head pose styles in the generated talking faces using a unified style-controllable framework. These styles can be reflected in two additional style reference videos, including the expression style and head pose style videos, which can be the same. The unified style-controllable framework is extended into two branches: (1) The stylized expression generation branch first extracts sequential 3DMM expression parameters from the expression style reference video $\boldsymbol{V}_{\delta}$ using the 3D face reconstruction module, and then feeds them into the expression style encoder $\boldsymbol{E}^{\delta}_s$ to obtain the expression style code ${\boldsymbol{s}_{\delta}}$. A phoneme encoder $\boldsymbol{E}_p$ encodes phoneme labels into phoneme features $\boldsymbol{p}'_{1,T}$. Then, the style-aware expression decoder $\boldsymbol{E}_d$ generates the stylized expression parameters ${\hat{\boldsymbol{\delta}}_{1:T}}$ with $\boldsymbol{s_{\delta}}$ and $\boldsymbol{p}'_{1:T}$. (2) Similarly, the stylized head pose generation branch first extracts sequential head poses from the head pose style reference video and obtains the head pose style code ${\boldsymbol{s}_{h}}$. An acoustic encoder $\boldsymbol{E}_a$ encodes acoustic features into latent features $\boldsymbol{a}'_{1,T}$. Then, we use a style-aware head pose decoder $\boldsymbol{E}_d$ to generate the stylized head movements ${\hat{\boldsymbol{h}}_{1:T}}$ from $\boldsymbol{s_{h}}$ and $\boldsymbol{a}'_{1:T}$. Finally, the image renderer $\boldsymbol{E}_r$ takes the assembled ${\hat{\boldsymbol{\delta}}_{1:T}}$ and ${\hat{\boldsymbol{h}}_{1:T}}$, and the identity reference image $\boldsymbol{I}^r$ as input, and generates the output video.
  • Figure 2: Illustration of the $i$-th step of the style-aware head pose decoder. The latent spatial embedding $\boldsymbol{e}_i$ and memory $\boldsymbol{c}_i$ are the intermediate features in this decoder and will be updated at each step.
  • Figure 3: Illustration of the style-aware adaptive transformer decoder layer.
  • Figure 4: Mouth embedding extraction in lip-sync discriminator.
  • Figure 5: Qualitative comparisons with the person agnostic methods. The identity reference, expression style reference videos, and audio-synced videos are displayed in the first two rows. This figure mainly showcases comparisons in visual quality, facial expression, and lip-sync accuracy. It is worth noting that for EAMM, GC-AVT, and our method, we use the same video clip as the expression style reference. For PC-AVS, AVCT, EAMM, GC-AVT, and our method, head poses are derived from the Mouth GT video. Please zoom in or see our demo video for more details.
  • ...and 10 more figures