GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

Yihong Lin; Zhaoxin Fan; Xianjia Wu; Lingyu Xiong; Liang Peng; Xiandong Li; Wenxiong Kang; Songju Lei; Huang Xu

GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

Yihong Lin, Zhaoxin Fan, Xianjia Wu, Lingyu Xiong, Liang Peng, Xiandong Li, Wenxiong Kang, Songju Lei, Huang Xu

TL;DR

The paper tackles the challenge of speech-driven 3D facial animation, where audio-to-motion mapping is inherently many-to-many and prone to misalignment.They introduce GLDiTalker, a Graph Latent Diffusion Transformer that operates in a quantized spatiotemporal latent space to improve lip-sync while enabling motion diversity.The method uses a two-stage pipeline: Stage 1 Graph Enhanced Quantized Space Learning to encode lip-sync-friendly latents, and Stage 2 Space-Time Powered Latent Diffusion to inject diversity conditioned on audio and speaker style.Experiments on BIWI and VOCASET show state-of-the-art lip-sync accuracy and motion diversity, validating the approach.

Abstract

Speech-driven talking head generation is a critical yet challenging task with applications in augmented reality and virtual human modeling. While recent approaches using autoregressive and diffusion-based models have achieved notable progress, they often suffer from modality inconsistencies, particularly misalignment between audio and mesh, leading to reduced motion diversity and lip-sync accuracy. To address this, we propose GLDiTalker, a novel speech-driven 3D facial animation model based on a Graph Latent Diffusion Transformer. GLDiTalker resolves modality misalignment by diffusing signals within a quantized spatiotemporal latent space. It employs a two-stage training pipeline: the Graph-Enhanced Quantized Space Learning Stage ensures lip-sync accuracy, while the Space-Time Powered Latent Diffusion Stage enhances motion diversity. Together, these stages enable GLDiTalker to generate realistic, temporally stable 3D facial animations. Extensive evaluations on standard benchmarks demonstrate that GLDiTalker outperforms existing methods, achieving superior results in both lip-sync accuracy and motion diversity.

GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

TL;DR

Abstract

Paper Structure (21 sections, 15 equations, 5 figures, 5 tables)

This paper contains 21 sections, 15 equations, 5 figures, 5 tables.

Introduction
Related Work
Speech-Driven 3D Facial Animation
Diffusion in Talking Head Animation
Method
Overview
Problem Definition
Quantized Space-Time Diffusion Training Pipeline
Stage 1: Graph Enhanced Quantized Space Learning for Lip-sync Accuracy Improvement
Constraints
Stage 2: Space-Time Powered Latent Diffusion for Motion Diversity Enhancement
Constraints
Experiments
Datasets and Implementations
Quantitative Evaluation
...and 6 more sections

Figures (5)

Figure 1: The left figures show the lip vertex error of GLDiTalker and FaceDiffuser and the right figures show the diversity of GLDiTalker and other methods, including FaceFormer, CodeTalker, TalkingStyle and FaceDiffuser. “Vertex Error” refers to the average $L_2$ loss between prediction and groundtruth. "Diversity" refers to the statistical mean of the motion differences between corresponding frames in multiple generation results within the same sequence. It is obvious that GLDiTalker has a significant advantage in both generation quality and motion diversity.
Figure 2: Overview of the proposed GLDiTalker. In order to enhance the lip-sync accuracy, Stage 1 employs the Spatial Pyramidal SpiralConv Encoder and temporal transformer encoder to sequentially process the input motions to obtain discrete latent facial motions. To enhance the motion diversity, Stage 2 then utilizes a diffusion-based iterative denoising technique to synthesize the latent facial motions from input speech and speaker identity, which is later decoded to a motion mesh sequence by the pretrained frozen decoder of the first stage. "$\oplus$" denotes addition and "N" denotes $N$ identical blocks, each consisting of a SpiralConv layer and a Downsample layer. The heatmaps show the statistical mean of the motion differences between corresponding frames in multiple generation results within the same sequence.
Figure 3: Architecture of Spatial Pyramidal SpiralConv Encoder. The spiral lines represent spiral convolution, the orange arrows represent downsampling, the green arrow represents flatten, and the blue arrow represents fusion by MLP.
Figure 4: Qualitative Comparision on VOCASET-Test (left) and BIWI-Test-B (right).
Figure 5: Standard deviation motion heatmaps of facial motion within a random sampled sequence, where dark blue means less motion variations and bright red means more motion variations.

GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

TL;DR

Abstract

GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (5)