Table of Contents
Fetching ...

DreamHead: Learning Spatial-Temporal Correspondence via Hierarchical Diffusion for Audio-driven Talking Head Synthesis

Fa-Ting Hong, Yunfei Liu, Yu Li, Changyin Zhou, Fei Yu, Dan Xu

TL;DR

DreamHead tackles the challenge of aligning temporal audio cues with spatial facial expressions in talking head synthesis by introducing a two-stage hierarchical diffusion framework. The first stage (A2L) learns to map audio segments to temporally smooth facial landmarks, while the second stage (L2I) uses those landmarks to condition a latent diffusion model that renders photorealistic frames with explicit spatial-consistency cues. By normalizing landmarks to a canonical pose and employing self-attention-based conditioning, the approach decouples pose/identity from expression and achieves robust cross-modal synchronization, even without ground-truth landmarks at inference. Experiments on HDTF and MEAD show state-of-the-art lip-sync accuracy, temporal stability, and image quality, supported by ablations that confirm the importance of temporal and spatial conditioning and the intermediate landmark representation.

Abstract

Audio-driven talking head synthesis strives to generate lifelike video portraits from provided audio. The diffusion model, recognized for its superior quality and robust generalization, has been explored for this task. However, establishing a robust correspondence between temporal audio cues and corresponding spatial facial expressions with diffusion models remains a significant challenge in talking head generation. To bridge this gap, we present DreamHead, a hierarchical diffusion framework that learns spatial-temporal correspondences in talking head synthesis without compromising the model's intrinsic quality and adaptability.~DreamHead learns to predict dense facial landmarks from audios as intermediate signals to model the spatial and temporal correspondences.~Specifically, a first hierarchy of audio-to-landmark diffusion is first designed to predict temporally smooth and accurate landmark sequences given audio sequence signals. Then, a second hierarchy of landmark-to-image diffusion is further proposed to produce spatially consistent facial portrait videos, by modeling spatial correspondences between the dense facial landmark and appearance. Extensive experiments show that proposed DreamHead can effectively learn spatial-temporal consistency with the designed hierarchical diffusion and produce high-fidelity audio-driven talking head videos for multiple identities.

DreamHead: Learning Spatial-Temporal Correspondence via Hierarchical Diffusion for Audio-driven Talking Head Synthesis

TL;DR

DreamHead tackles the challenge of aligning temporal audio cues with spatial facial expressions in talking head synthesis by introducing a two-stage hierarchical diffusion framework. The first stage (A2L) learns to map audio segments to temporally smooth facial landmarks, while the second stage (L2I) uses those landmarks to condition a latent diffusion model that renders photorealistic frames with explicit spatial-consistency cues. By normalizing landmarks to a canonical pose and employing self-attention-based conditioning, the approach decouples pose/identity from expression and achieves robust cross-modal synchronization, even without ground-truth landmarks at inference. Experiments on HDTF and MEAD show state-of-the-art lip-sync accuracy, temporal stability, and image quality, supported by ablations that confirm the importance of temporal and spatial conditioning and the intermediate landmark representation.

Abstract

Audio-driven talking head synthesis strives to generate lifelike video portraits from provided audio. The diffusion model, recognized for its superior quality and robust generalization, has been explored for this task. However, establishing a robust correspondence between temporal audio cues and corresponding spatial facial expressions with diffusion models remains a significant challenge in talking head generation. To bridge this gap, we present DreamHead, a hierarchical diffusion framework that learns spatial-temporal correspondences in talking head synthesis without compromising the model's intrinsic quality and adaptability.~DreamHead learns to predict dense facial landmarks from audios as intermediate signals to model the spatial and temporal correspondences.~Specifically, a first hierarchy of audio-to-landmark diffusion is first designed to predict temporally smooth and accurate landmark sequences given audio sequence signals. Then, a second hierarchy of landmark-to-image diffusion is further proposed to produce spatially consistent facial portrait videos, by modeling spatial correspondences between the dense facial landmark and appearance. Extensive experiments show that proposed DreamHead can effectively learn spatial-temporal consistency with the designed hierarchical diffusion and produce high-fidelity audio-driven talking head videos for multiple identities.
Paper Structure (13 sections, 5 equations, 8 figures, 5 tables)

This paper contains 13 sections, 5 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: We propose a hierarchical diffusion framework (DreamHead) that learns to diffuse facial landmarks as intermediate signals to represent facial expression and targets learning spatial-temporal correspondences in audio-driven talking head synthesis. Given a driving audio sequence, our DreamHead can estimate jittering-less landmark sequences corresponding to the audio temporally and synthesize temporal-smooth and lip-synced talking videos via explicit spatial consistency from predicted landmarks, and no GT landmarks are required during inference.
  • Figure 2: The illustration of our framework. Our proposed DreamHead diffuses landmarks as intermediate signals to learn the spatial-temporal correspondence for talking head video generation. (a) The first hierarchy of audio-to-landmark diffusion (A2L) takes an audio sequence as input to predict a temporal-correspondence landmark sequence with a corresponding lip shape. By cooperating with temporal units in the A2L network, we can produce a jittering-less landmark sequence. (b) The second hierarchy of landmark-to-image diffusion process aims to produce the final portrait video given a spatial-correspondence condition set. $\mathcal{E}$ is an image encoder to downsample the input image, while the $\mathcal{D}$ is a decoder that upsamples a latent to generate an image.
  • Figure 3: The illustration of our audio-to-landmark diffusion process. The audio-to-landmark network (A2L network) contains multiple fully connected layers to change the dimensions of inputs. Moreover, multiple temporal units in the A2L network can perceive the temporal information from audio cues.
  • Figure 4: The illustration of the inference process. Given a video as a source and a segment of the input audio, our DreamHead outputs a sequence of portrait images. "$\textcircled{\scriptsize{T}}$" means the transformation operation. We transform the the predicted landmarks $P_0^{[i,i-l]}$ after de-normalization. In this work, we use 3DFFA-v2 guo2020towards as the head pose detector. $\sigma$ is the variance of the canonical landmarks (see Sec \ref{['sec:a2l']} for details).
  • Figure 5: Visual comparison with other methods on cross-identity setting. Our method produces more accurate results compared with other methods.
  • ...and 3 more figures