Table of Contents
Fetching ...

Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

Jintao Tan, Xize Cheng, Lingyu Xiong, Lei Zhu, Xiandong Li, Xianjia Wu, Kai Gong, Minglei Li, Yi Cai

TL;DR

The paper tackles audio-driven talking head generation by balancing lip-speech synchronization, visual fidelity, and temporal coherence. It introduces a two-stage landmark-guided diffusion framework: an Audio-driven landmark generation stage (A2L) produces landmark sequences from speech, which condition a Landmark-driven talking head generation stage (L2V) that denoises latent representations to synthesize high-quality, synchronized video. The approach yields state-of-the-art visual metrics on the HDTF dataset while maintaining competitive lip-sync, and ablation studies confirm the benefits of using landmarks as a robust intermediate representation. This landmark-guided diffusion pipeline enables more realistic and temporally stable talking head avatars, with potential impact on virtual avatars, film production, and online conferencing.

Abstract

Audio-driven talking head generation is a significant and challenging task applicable to various fields such as virtual avatars, film production, and online conferences. However, the existing GAN-based models emphasize generating well-synchronized lip shapes but overlook the visual quality of generated frames, while diffusion-based models prioritize generating high-quality frames but neglect lip shape matching, resulting in jittery mouth movements. To address the aforementioned problems, we introduce a two-stage diffusion-based model. The first stage involves generating synchronized facial landmarks based on the given speech. In the second stage, these generated landmarks serve as a condition in the denoising process, aiming to optimize mouth jitter issues and generate high-fidelity, well-synchronized, and temporally coherent talking head videos. Extensive experiments demonstrate that our model yields the best performance.

Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

TL;DR

The paper tackles audio-driven talking head generation by balancing lip-speech synchronization, visual fidelity, and temporal coherence. It introduces a two-stage landmark-guided diffusion framework: an Audio-driven landmark generation stage (A2L) produces landmark sequences from speech, which condition a Landmark-driven talking head generation stage (L2V) that denoises latent representations to synthesize high-quality, synchronized video. The approach yields state-of-the-art visual metrics on the HDTF dataset while maintaining competitive lip-sync, and ablation studies confirm the benefits of using landmarks as a robust intermediate representation. This landmark-guided diffusion pipeline enables more realistic and temporally stable talking head avatars, with potential impact on virtual avatars, film production, and online conferencing.

Abstract

Audio-driven talking head generation is a significant and challenging task applicable to various fields such as virtual avatars, film production, and online conferences. However, the existing GAN-based models emphasize generating well-synchronized lip shapes but overlook the visual quality of generated frames, while diffusion-based models prioritize generating high-quality frames but neglect lip shape matching, resulting in jittery mouth movements. To address the aforementioned problems, we introduce a two-stage diffusion-based model. The first stage involves generating synchronized facial landmarks based on the given speech. In the second stage, these generated landmarks serve as a condition in the denoising process, aiming to optimize mouth jitter issues and generate high-fidelity, well-synchronized, and temporally coherent talking head videos. Extensive experiments demonstrate that our model yields the best performance.
Paper Structure (10 sections, 6 equations, 2 figures, 2 tables)

This paper contains 10 sections, 6 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of our proposed model for talking head generation, which consists of two sub-modules. (a)Audio-driven landmark generation (A2L) module takes as input the given speech and the original facial image to generate the landmark movement sequence. (b)The generated landmarks will serve as a condition for the denoising process in landmark-driven talking head generation (L2V) module.
  • Figure 2: Visual comparisons between our proposed method and several state-of-the-art methods Wav2lip b2, MakeItTalk b8, Talklip b1, SadTalker b21 and Difftalk b3.