Table of Contents
Fetching ...

SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model

Weipeng Tan, Chuming Lin, Chengming Xu, Xiaozhong Ji, Junwei Zhu, Chengjie Wang, Yunsheng Wu, Yanwei Fu

TL;DR

This work tackles talking head generation (THG) by explicitly modeling intrinsic style, which captures speaking habits and facial expressions often neglected by existing diffusion-based methods. It introduces Probabilistic Style Prior Learning to represent intrinsic style as a Gaussian $s \sim \mathcal{N}(\mu_s, \sigma_s^2)$ learned from paired audio $oldsymbol{\alpha}$ and 3DMM expressions $\boldsymbol{\beta}$, enabling stochastic variation through sampling. The Style-Driven Diffusion Process injects this style prior into a pretrained Stable Diffusion backbone via two modules: HEAD-Kps Guider and Style Projection, with a three-stage training regime that first learns the style extractor and then progressively finetunes the diffusion model. Experiments on MEAD and HDTF demonstrate state-of-the-art performance across FVD, FID, PSNR, SSIM, M-LMD, SyncNet, and StyleSim, and show robust intrinsic style transfer and interpolation for unseen faces, indicating strong practical potential for personalized, realistic digital humans.

Abstract

Talking Head Generation (THG), typically driven by audio, is an important and challenging task with broad application prospects in various fields such as digital humans, film production, and virtual reality. While diffusion model-based THG methods present high quality and stable content generation, they often overlook the intrinsic style which encompasses personalized features such as speaking habits and facial expressions of a video. As consequence, the generated video content lacks diversity and vividness, thus being limited in real life scenarios. To address these issues, we propose a novel framework named Style-Enhanced Vivid Portrait (SVP) which fully leverages style-related information in THG. Specifically, we first introduce the novel probabilistic style prior learning to model the intrinsic style as a Gaussian distribution using facial expressions and audio embedding. The distribution is learned through the 'bespoked' contrastive objective, effectively capturing the dynamic style information in each video. Then we finetune a pretrained Stable Diffusion (SD) model to inject the learned intrinsic style as a controlling signal via cross attention. Experiments show that our model generates diverse, vivid, and high-quality videos with flexible control over intrinsic styles, outperforming existing state-of-the-art methods.

SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model

TL;DR

This work tackles talking head generation (THG) by explicitly modeling intrinsic style, which captures speaking habits and facial expressions often neglected by existing diffusion-based methods. It introduces Probabilistic Style Prior Learning to represent intrinsic style as a Gaussian learned from paired audio and 3DMM expressions , enabling stochastic variation through sampling. The Style-Driven Diffusion Process injects this style prior into a pretrained Stable Diffusion backbone via two modules: HEAD-Kps Guider and Style Projection, with a three-stage training regime that first learns the style extractor and then progressively finetunes the diffusion model. Experiments on MEAD and HDTF demonstrate state-of-the-art performance across FVD, FID, PSNR, SSIM, M-LMD, SyncNet, and StyleSim, and show robust intrinsic style transfer and interpolation for unseen faces, indicating strong practical potential for personalized, realistic digital humans.

Abstract

Talking Head Generation (THG), typically driven by audio, is an important and challenging task with broad application prospects in various fields such as digital humans, film production, and virtual reality. While diffusion model-based THG methods present high quality and stable content generation, they often overlook the intrinsic style which encompasses personalized features such as speaking habits and facial expressions of a video. As consequence, the generated video content lacks diversity and vividness, thus being limited in real life scenarios. To address these issues, we propose a novel framework named Style-Enhanced Vivid Portrait (SVP) which fully leverages style-related information in THG. Specifically, we first introduce the novel probabilistic style prior learning to model the intrinsic style as a Gaussian distribution using facial expressions and audio embedding. The distribution is learned through the 'bespoked' contrastive objective, effectively capturing the dynamic style information in each video. Then we finetune a pretrained Stable Diffusion (SD) model to inject the learned intrinsic style as a controlling signal via cross attention. Experiments show that our model generates diverse, vivid, and high-quality videos with flexible control over intrinsic styles, outperforming existing state-of-the-art methods.
Paper Structure (25 sections, 5 equations, 6 figures, 3 tables)

This paper contains 25 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: In talking head generation, given a audio and the reference image, both the GAN-based method SadTalker and the diffusion-based method V-Express have generated monotonous portrait videos, in which the primary movement is observed in the lips. In contrast, our approach is capable of generating diverse and vivid portrait videos based on varying intrinsic styles.
  • Figure 2: The Framework of $\mathtt{SVP}$. Our $\mathtt{SVP}$ Framework includes Probabilistic Style Prior Learning and Style-Driven Diffusion Process. In Probabilistic Style Prior Learning, we utilizes a dual-branch transformer to convert the audio features $\alpha$ and the expression parameters $\beta$ into the latent vectors $\hat{s}^{\alpha}$ and $\hat{s}^{\beta}$ respectively, then obtained the style-related embedding $\hat{s}$ via the cross attention layer. Finally, we uses the learnable parameter $W_s$ to map the embedding $\hat{s}$ to mean $\mu_s$ and variance $\sigma_s$, and the style prior $s$ is sampled by $\mathcal{N}(\mu_s, \sigma_s^2)$. In Style-Driven Diffusion Process, the Denoising UNet takes the reference image, Head-Kps sequences, audio features and style prior as conditions to denoise the input noise at each time step.
  • Figure 3: (a) Visual comparison with recent SOTA methods. The first two rows show the comparison of reconstruction results, while the last two rows show the comparison of intrinsic style transfer results. (b) Our method uses intrinsic style to generate frames on different types of portraits. This demonstrates that our method can successfully apply intrinsic style to various types of data, even if only real-life videos are available for training.
  • Figure 4: Comparison of visualization results with and without Probabilistic Style Prior Learning.
  • Figure 5: Intrinsic style prior visualization. The color gets darker as the intensity of the emotion increases.
  • ...and 1 more figures