Table of Contents
Fetching ...

SyncDiff: Diffusion-based Talking Head Synthesis with Bottlenecked Temporal Visual Prior for Improved Synchronization

Xulin Fan, Heting Gao, Ziyi Chen, Peng Chang, Mei Han, Mark Hasegawa-Johnson

TL;DR

SyncDiff tackles the diffusion-based talking head synthesis challenge by incorporating a bottlenecked temporal pose prior and AVHuBERT audio-visual features as conditioning inputs. The method uses a triple-prior scheme (masked frame, identity frame, and bottlenecked pose frame) and a latent diffusion model with cross-attention to achieve strong lip synchronization while maintaining high image fidelity. Experimental results on LRS2 and LRS3 show notable gains in synchronization over prior diffusion-based methods and competitive performance with GAN-based approaches, aided by AVHuBERT features and careful temporal prior design. While diffusion-based generation remains slower and may struggle with out-of-distribution data, the proposed approach provides a substantial step toward robust, high-quality, lip-synced talking head synthesis with practical implications for video dubbing, virtual avatars, and online communication.

Abstract

Talking head synthesis, also known as speech-to-lip synthesis, reconstructs the facial motions that align with the given audio tracks. The synthesized videos are evaluated on mainly two aspects, lip-speech synchronization and image fidelity. Recent studies demonstrate that GAN-based and diffusion-based models achieve state-of-the-art (SOTA) performance on this task, with diffusion-based models achieving superior image fidelity but experiencing lower synchronization compared to their GAN-based counterparts. To this end, we propose SyncDiff, a simple yet effective approach to improve diffusion-based models using a temporal pose frame with information bottleneck and facial-informative audio features extracted from AVHuBERT, as conditioning input into the diffusion process. We evaluate SyncDiff on two canonical talking head datasets, LRS2 and LRS3 for direct comparison with other SOTA models. Experiments on LRS2/LRS3 datasets show that SyncDiff achieves a synchronization score 27.7%/62.3% relatively higher than previous diffusion-based methods, while preserving their high-fidelity characteristics.

SyncDiff: Diffusion-based Talking Head Synthesis with Bottlenecked Temporal Visual Prior for Improved Synchronization

TL;DR

SyncDiff tackles the diffusion-based talking head synthesis challenge by incorporating a bottlenecked temporal pose prior and AVHuBERT audio-visual features as conditioning inputs. The method uses a triple-prior scheme (masked frame, identity frame, and bottlenecked pose frame) and a latent diffusion model with cross-attention to achieve strong lip synchronization while maintaining high image fidelity. Experimental results on LRS2 and LRS3 show notable gains in synchronization over prior diffusion-based methods and competitive performance with GAN-based approaches, aided by AVHuBERT features and careful temporal prior design. While diffusion-based generation remains slower and may struggle with out-of-distribution data, the proposed approach provides a substantial step toward robust, high-quality, lip-synced talking head synthesis with practical implications for video dubbing, virtual avatars, and online communication.

Abstract

Talking head synthesis, also known as speech-to-lip synthesis, reconstructs the facial motions that align with the given audio tracks. The synthesized videos are evaluated on mainly two aspects, lip-speech synchronization and image fidelity. Recent studies demonstrate that GAN-based and diffusion-based models achieve state-of-the-art (SOTA) performance on this task, with diffusion-based models achieving superior image fidelity but experiencing lower synchronization compared to their GAN-based counterparts. To this end, we propose SyncDiff, a simple yet effective approach to improve diffusion-based models using a temporal pose frame with information bottleneck and facial-informative audio features extracted from AVHuBERT, as conditioning input into the diffusion process. We evaluate SyncDiff on two canonical talking head datasets, LRS2 and LRS3 for direct comparison with other SOTA models. Experiments on LRS2/LRS3 datasets show that SyncDiff achieves a synchronization score 27.7%/62.3% relatively higher than previous diffusion-based methods, while preserving their high-fidelity characteristics.

Paper Structure

This paper contains 25 sections, 3 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Architecture of the SyncDiff network. Solid lines denote inputs during training while dashed lines denote inputs during inference. $I_t$ denotes the ground-truth frame at timestep t, $\hat{I}_t$ denotes the synthesized frame at timestep t, and $I_{rand}$ denotes randomly sampled frame from the groundtruth sequence. BN denotes the bottleneck layer which compresses the dimension of the pose prior.
  • Figure 2: Visual comparison with SOTA talking head generation methods. The letter that each frame corresponds to is marked in red.
  • Figure 3: Visual comparison of using different reference frames.