Table of Contents
Fetching ...

GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting

Hongyun Yu, Zhan Qu, Qihang Yu, Jianchuan Chen, Zhonghua Jiang, Zhiwen Chen, Shengyu Zhang, Jimin Xu, Fei Wu, Chengfei Lv, Gang Yu

TL;DR

GaussianTalker addresses the challenge of high-fidelity, speaker-specific talking head synthesis by marrying 3D Gaussian Splatting with a FLAME-based facial model. The framework decouples identity from content using a Universal Audio Encoder and Personalised Motion Decoder to generate FLAME parameters, while a Dynamic Gaussian Renderer binds Gaussians to FLAME topology and refines details via Speaker-specific BlendShapes and an Inpainting Generator. It achieves precise lip synchronization and superior image quality, with real-time rendering at up to 130 FPS on modern GPUs and robust cross-language, cross-identity performance. The approach outperforms state-of-the-art NeRF-based and mesh-based methods and offers strong generalization and platform flexibility for real-time applications.

Abstract

Recent works on audio-driven talking head synthesis using Neural Radiance Fields (NeRF) have achieved impressive results. However, due to inadequate pose and expression control caused by NeRF implicit representation, these methods still have some limitations, such as unsynchronized or unnatural lip movements, and visual jitter and artifacts. In this paper, we propose GaussianTalker, a novel method for audio-driven talking head synthesis based on 3D Gaussian Splatting. With the explicit representation property of 3D Gaussians, intuitive control of the facial motion is achieved by binding Gaussians to 3D facial models. GaussianTalker consists of two modules, Speaker-specific Motion Translator and Dynamic Gaussian Renderer. Speaker-specific Motion Translator achieves accurate lip movements specific to the target speaker through universalized audio feature extraction and customized lip motion generation. Dynamic Gaussian Renderer introduces Speaker-specific BlendShapes to enhance facial detail representation via a latent pose, delivering stable and realistic rendered videos. Extensive experimental results suggest that GaussianTalker outperforms existing state-of-the-art methods in talking head synthesis, delivering precise lip synchronization and exceptional visual quality. Our method achieves rendering speeds of 130 FPS on NVIDIA RTX4090 GPU, significantly exceeding the threshold for real-time rendering performance, and can potentially be deployed on other hardware platforms.

GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting

TL;DR

GaussianTalker addresses the challenge of high-fidelity, speaker-specific talking head synthesis by marrying 3D Gaussian Splatting with a FLAME-based facial model. The framework decouples identity from content using a Universal Audio Encoder and Personalised Motion Decoder to generate FLAME parameters, while a Dynamic Gaussian Renderer binds Gaussians to FLAME topology and refines details via Speaker-specific BlendShapes and an Inpainting Generator. It achieves precise lip synchronization and superior image quality, with real-time rendering at up to 130 FPS on modern GPUs and robust cross-language, cross-identity performance. The approach outperforms state-of-the-art NeRF-based and mesh-based methods and offers strong generalization and platform flexibility for real-time applications.

Abstract

Recent works on audio-driven talking head synthesis using Neural Radiance Fields (NeRF) have achieved impressive results. However, due to inadequate pose and expression control caused by NeRF implicit representation, these methods still have some limitations, such as unsynchronized or unnatural lip movements, and visual jitter and artifacts. In this paper, we propose GaussianTalker, a novel method for audio-driven talking head synthesis based on 3D Gaussian Splatting. With the explicit representation property of 3D Gaussians, intuitive control of the facial motion is achieved by binding Gaussians to 3D facial models. GaussianTalker consists of two modules, Speaker-specific Motion Translator and Dynamic Gaussian Renderer. Speaker-specific Motion Translator achieves accurate lip movements specific to the target speaker through universalized audio feature extraction and customized lip motion generation. Dynamic Gaussian Renderer introduces Speaker-specific BlendShapes to enhance facial detail representation via a latent pose, delivering stable and realistic rendered videos. Extensive experimental results suggest that GaussianTalker outperforms existing state-of-the-art methods in talking head synthesis, delivering precise lip synchronization and exceptional visual quality. Our method achieves rendering speeds of 130 FPS on NVIDIA RTX4090 GPU, significantly exceeding the threshold for real-time rendering performance, and can potentially be deployed on other hardware platforms.
Paper Structure (31 sections, 21 equations, 4 figures, 3 tables)

This paper contains 31 sections, 21 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of the proposed GaussianTalker. Subfigure (a) depicts speaker-specific FLAME generated from audio, driving Gaussians for rendering. Subfigure (b) illustrates the fusion of speaker-agnostic feature with speaker ID embeddings to decode FLAME. Subfigure (c) shows Gaussians driven by FLAME, subsequently rendering frames.
  • Figure 2: Negative audios $A_{neg,i}$ are obtained by dividing the audio and positive audio $A_{pos}$ is obtained by timbre conversion. Audios are encoded to get the corresponding features for adversarial learning to fine-tune the encoder to become a Universal Audio Encoder.
  • Figure 3: The comparison of generated key frame results. We show the ground truth frames for comparing and mark the un-sync and bad rendering quality results with red arrows. Please zoom in for better visualization.
  • Figure 4: Ablation study on audio decoupling and Speaker-Specific BlendShape. Removing them will lead to (a) and (b).