Table of Contents
Fetching ...

TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting

Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, Lin Gu

TL;DR

TalkingGaussian addresses distortions in NeRF-based 3D talking head synthesis by adopting a deformation-based approach on a persistent head structure built with 3D Gaussian Splatting. It introduces Deformable Gaussian Fields with a Face-Mouth Decomposition to separately learn facial and inside-mouth motions, enabling precise lip synchronization and robust head fidelity. The method demonstrates superior quantitative and qualitative performance and improved efficiency, highlighting the benefits of explicit deformation over appearance-modification in dynamic regions. This work offers a practical pathway for high-quality audio-driven 3D talking heads with strong generalization and potential applications in animation and virtual communication.

Abstract

Radiance fields have demonstrated impressive performance in synthesizing lifelike 3D talking heads. However, due to the difficulty in fitting steep appearance changes, the prevailing paradigm that presents facial motions by directly modifying point appearance may lead to distortions in dynamic regions. To tackle this challenge, we introduce TalkingGaussian, a deformation-based radiance fields framework for high-fidelity talking head synthesis. Leveraging the point-based Gaussian Splatting, facial motions can be represented in our method by applying smooth and continuous deformations to persistent Gaussian primitives, without requiring to learn the difficult appearance change like previous methods. Due to this simplification, precise facial motions can be synthesized while keeping a highly intact facial feature. Under such a deformation paradigm, we further identify a face-mouth motion inconsistency that would affect the learning of detailed speaking motions. To address this conflict, we decompose the model into two branches separately for the face and inside mouth areas, therefore simplifying the learning tasks to help reconstruct more accurate motion and structure of the mouth region. Extensive experiments demonstrate that our method renders high-quality lip-synchronized talking head videos, with better facial fidelity and higher efficiency compared with previous methods.

TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting

TL;DR

TalkingGaussian addresses distortions in NeRF-based 3D talking head synthesis by adopting a deformation-based approach on a persistent head structure built with 3D Gaussian Splatting. It introduces Deformable Gaussian Fields with a Face-Mouth Decomposition to separately learn facial and inside-mouth motions, enabling precise lip synchronization and robust head fidelity. The method demonstrates superior quantitative and qualitative performance and improved efficiency, highlighting the benefits of explicit deformation over appearance-modification in dynamic regions. This work offers a practical pathway for high-quality audio-driven 3D talking heads with strong generalization and potential applications in animation and virtual communication.

Abstract

Radiance fields have demonstrated impressive performance in synthesizing lifelike 3D talking heads. However, due to the difficulty in fitting steep appearance changes, the prevailing paradigm that presents facial motions by directly modifying point appearance may lead to distortions in dynamic regions. To tackle this challenge, we introduce TalkingGaussian, a deformation-based radiance fields framework for high-fidelity talking head synthesis. Leveraging the point-based Gaussian Splatting, facial motions can be represented in our method by applying smooth and continuous deformations to persistent Gaussian primitives, without requiring to learn the difficult appearance change like previous methods. Due to this simplification, precise facial motions can be synthesized while keeping a highly intact facial feature. Under such a deformation paradigm, we further identify a face-mouth motion inconsistency that would affect the learning of detailed speaking motions. To address this conflict, we decompose the model into two branches separately for the face and inside mouth areas, therefore simplifying the learning tasks to help reconstruct more accurate motion and structure of the mouth region. Extensive experiments demonstrate that our method renders high-quality lip-synchronized talking head videos, with better facial fidelity and higher efficiency compared with previous methods.
Paper Structure (23 sections, 12 equations, 9 figures, 7 tables)

This paper contains 23 sections, 12 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Inaccurate predictions of the rapidly changing appearance often produce distorted facial features in previous NeRF-based methods. By keeping a persistent head structure and predicting deformation to represent facial motion, our TalkingGaussian outperforms previous methods in synthesizing more precise and clear talking heads.
  • Figure 2: Overview of TalkingGaussian. Learning from the speech video with training frames $I$, TalkingGaussian builds two separate branches to represent the dynamic face and inside mouth areas. Queried by the primitives in Persistent Gaussian Fields with parameters $\theta_C$, a point-wise deformation can be predicted from Grid-based Motion Fields conditioned with audio feature $\boldsymbol{a}$ and upper-face expression $\boldsymbol{e}$. After that, the 3DGS rasterizer renders the deformed 3D Gaussian primitives into 2D images observed from the given camera, which are then fused to synthesize the entire talking head.
  • Figure 3: (a) The reconstructed facial motion results represented by deformation and appearance modification. (b) The visualized traces of the changing coordinate offset (deformation) and color in RGB (appearance modification) of two points with the same initial position. During the process, offset changes smoothly and the corresponding results are clear and accurate. Instead, some sudden changes with a large step length may occur in color, which is difficult to fit and causes a distorted mouth (red box).
  • Figure 4: (a) Lips and the inside mouth, especially teeth, are hard to be correctly divided with a single motion field. (b) This would further affect the learning of the mouth structure and speaking motions, resulting in bad quality. Our Face-Mouth Decomposition can successfully address this problem and render high-fidelity results.
  • Figure 5: Qualitative comparison of visual-audio synchronization. Our method performs best in synthesizing accurately synchronized talking head compared with all baselines prajwal2020wav2lipzhong2023identityzhang2023dinetguo2021adshen2022dfrftang2022radye2023genefaceli2023efficient. Please zoom in for better visualization.
  • ...and 4 more figures