Table of Contents
Fetching ...

PointTalk: Audio-Driven Dynamic Lip Point Cloud for 3D Gaussian-based Talking Head Synthesis

Yifan Xie, Tao Feng, Xin Zhang, Xiangyang Luo, Zixuan Guo, Weijiang Yu, Heng Chang, Fei Ma, Fei Richard Yu

TL;DR

PointTalk tackles the challenge of high-fidelity, audio-driven talking head synthesis with limited data by coupling a static 3D Gaussian head to an audio-driven dynamic lip point cloud. The method introduces Audio2Point to generate lip points from speech, a dynamic difference encoder to capture lip motion nuances, and an audio-point enhancement module for cross-modal synchronization, all feeding an adaptive 3D Gaussian Splatting renderer. Key innovations include a tri-plane hash-encoded geometry, cross-modal contrastive learning between audio and lip points, and AdaIN-style feature fusion to produce deformation parameters for real-time rendering. Experiments show improved visual quality and lip synchronization over state-of-the-art methods, with strong generalization to multilingual data and efficient inference, enabling realistic digital humans in practical settings.

Abstract

Talking head synthesis with arbitrary speech audio is a crucial challenge in the field of digital humans. Recently, methods based on radiance fields have received increasing attention due to their ability to synthesize high-fidelity and identity-consistent talking heads from just a few minutes of training video. However, due to the limited scale of the training data, these methods often exhibit poor performance in audio-lip synchronization and visual quality. In this paper, we propose a novel 3D Gaussian-based method called PointTalk, which constructs a static 3D Gaussian field of the head and deforms it in sync with the audio. It also incorporates an audio-driven dynamic lip point cloud as a critical component of the conditional information, thereby facilitating the effective synthesis of talking heads. Specifically, the initial step involves generating the corresponding lip point cloud from the audio signal and capturing its topological structure. The design of the dynamic difference encoder aims to capture the subtle nuances inherent in dynamic lip movements more effectively. Furthermore, we integrate the audio-point enhancement module, which not only ensures the synchronization of the audio signal with the corresponding lip point cloud within the feature space, but also facilitates a deeper understanding of the interrelations among cross-modal conditional features. Extensive experiments demonstrate that our method achieves superior high-fidelity and audio-lip synchronization in talking head synthesis compared to previous methods.

PointTalk: Audio-Driven Dynamic Lip Point Cloud for 3D Gaussian-based Talking Head Synthesis

TL;DR

PointTalk tackles the challenge of high-fidelity, audio-driven talking head synthesis with limited data by coupling a static 3D Gaussian head to an audio-driven dynamic lip point cloud. The method introduces Audio2Point to generate lip points from speech, a dynamic difference encoder to capture lip motion nuances, and an audio-point enhancement module for cross-modal synchronization, all feeding an adaptive 3D Gaussian Splatting renderer. Key innovations include a tri-plane hash-encoded geometry, cross-modal contrastive learning between audio and lip points, and AdaIN-style feature fusion to produce deformation parameters for real-time rendering. Experiments show improved visual quality and lip synchronization over state-of-the-art methods, with strong generalization to multilingual data and efficient inference, enabling realistic digital humans in practical settings.

Abstract

Talking head synthesis with arbitrary speech audio is a crucial challenge in the field of digital humans. Recently, methods based on radiance fields have received increasing attention due to their ability to synthesize high-fidelity and identity-consistent talking heads from just a few minutes of training video. However, due to the limited scale of the training data, these methods often exhibit poor performance in audio-lip synchronization and visual quality. In this paper, we propose a novel 3D Gaussian-based method called PointTalk, which constructs a static 3D Gaussian field of the head and deforms it in sync with the audio. It also incorporates an audio-driven dynamic lip point cloud as a critical component of the conditional information, thereby facilitating the effective synthesis of talking heads. Specifically, the initial step involves generating the corresponding lip point cloud from the audio signal and capturing its topological structure. The design of the dynamic difference encoder aims to capture the subtle nuances inherent in dynamic lip movements more effectively. Furthermore, we integrate the audio-point enhancement module, which not only ensures the synchronization of the audio signal with the corresponding lip point cloud within the feature space, but also facilitates a deeper understanding of the interrelations among cross-modal conditional features. Extensive experiments demonstrate that our method achieves superior high-fidelity and audio-lip synchronization in talking head synthesis compared to previous methods.

Paper Structure

This paper contains 15 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of PointTalk. Utilizing the static Gaussian field to optimize the coarse Gaussian head from a random point cloud. Then the tri-plane encoder and audio encoder independently extract the spatial geometry feature $\mathrm{F}_c$ and the audio feature $\mathrm{F}_a$. The Audio2Point module generates a dynamic lip point cloud based on the input audio signals. Subsequently, the dynamic lip point cloud's topological structure is established, and the dynamic difference encoder extracts the point cloud feature $\mathrm{F}_p$. Moreover, the audio-point enhancement module synchronizes the audio signals with the point cloud to facilitate information interaction, thereby obtaining the enhancement features $\mathrm{\hat{F}}_{a}$ and $\mathrm{\hat{F}}_{p}$. Ultimately, the enhancement features are fed into two MLP decoders to compute the scale and shift factors. By integrating these factors with $\mathrm{F}_c$, an adaptive MLP is deployed to predict the deformation parameters for 3DGS rasterizer.
  • Figure 2: The pipeline of the Audio2Point module.
  • Figure 3: The detailed structure of the Audio-Point Enhancement module.
  • Figure 4: Qualitative comparison of talking head synthesis by different methods. PointTalk has the best visual effect on lip movements and facial details. Please zoom in for better visualization.
  • Figure 5: User study. The rating scale ranges from 1 to 5, with higher numbers indicating better performance.
  • ...and 1 more figures