Table of Contents
Fetching ...

Toward Fine-Grained Facial Control in 3D Talking Head Generation

Shaoyang Xie, Xiaofeng Cong, Baosheng Yu, Zhipeng Gui, Jie Gui, Yuan Yan Tang, James Tin-Yau Kwok

TL;DR

This paper tackles the challenge of fine-grained control in audio-driven 3D talking head generation, particularly lip-sync accuracy and temporal stability. It introduces FG-3DGS, a framework that performs frequency-aware disentanglement by modeling low-frequency facial regions with a shared MLP and high-frequency regions (eyes, mouth) with region-specific networks, using Gaussian deltas ${\Delta G_r}$ to drive motion. A high-frequency refined post-rendering alignment, guided by a lip-sync discriminator, further enhances synchronization and per-frame accuracy. Experiments show FG-3DGS achieves state-of-the-art fidelity and lip synchronization across reconstruction and cross-subject tests, demonstrating the effectiveness of region-specific motion modeling and post-rendering alignment for realistic, real-time talking heads.

Abstract

Audio-driven talking head generation is a core component of digital avatars, and 3D Gaussian Splatting has shown strong performance in real-time rendering of high-fidelity talking heads. However, achieving precise control over fine-grained facial movements remains a significant challenge, particularly due to lip-synchronization inaccuracies and facial jitter, both of which can contribute to the uncanny valley effect. To address these challenges, we propose Fine-Grained 3D Gaussian Splatting (FG-3DGS), a novel framework that enables temporally consistent and high-fidelity talking head generation. Our method introduces a frequency-aware disentanglement strategy to explicitly model facial regions based on their motion characteristics. Low-frequency regions, such as the cheeks, nose, and forehead, are jointly modeled using a standard MLP, while high-frequency regions, including the eyes and mouth, are captured separately using a dedicated network guided by facial area masks. The predicted motion dynamics, represented as Gaussian deltas, are applied to the static Gaussians to generate the final head frames, which are rendered via a rasterizer using frame-specific camera parameters. Additionally, a high-frequency-refined post-rendering alignment mechanism, learned from large-scale audio-video pairs by a pretrained model, is incorporated to enhance per-frame generation and achieve more accurate lip synchronization. Extensive experiments on widely used datasets for talking head generation demonstrate that our method outperforms recent state-of-the-art approaches in producing high-fidelity, lip-synced talking head videos.

Toward Fine-Grained Facial Control in 3D Talking Head Generation

TL;DR

This paper tackles the challenge of fine-grained control in audio-driven 3D talking head generation, particularly lip-sync accuracy and temporal stability. It introduces FG-3DGS, a framework that performs frequency-aware disentanglement by modeling low-frequency facial regions with a shared MLP and high-frequency regions (eyes, mouth) with region-specific networks, using Gaussian deltas to drive motion. A high-frequency refined post-rendering alignment, guided by a lip-sync discriminator, further enhances synchronization and per-frame accuracy. Experiments show FG-3DGS achieves state-of-the-art fidelity and lip synchronization across reconstruction and cross-subject tests, demonstrating the effectiveness of region-specific motion modeling and post-rendering alignment for realistic, real-time talking heads.

Abstract

Audio-driven talking head generation is a core component of digital avatars, and 3D Gaussian Splatting has shown strong performance in real-time rendering of high-fidelity talking heads. However, achieving precise control over fine-grained facial movements remains a significant challenge, particularly due to lip-synchronization inaccuracies and facial jitter, both of which can contribute to the uncanny valley effect. To address these challenges, we propose Fine-Grained 3D Gaussian Splatting (FG-3DGS), a novel framework that enables temporally consistent and high-fidelity talking head generation. Our method introduces a frequency-aware disentanglement strategy to explicitly model facial regions based on their motion characteristics. Low-frequency regions, such as the cheeks, nose, and forehead, are jointly modeled using a standard MLP, while high-frequency regions, including the eyes and mouth, are captured separately using a dedicated network guided by facial area masks. The predicted motion dynamics, represented as Gaussian deltas, are applied to the static Gaussians to generate the final head frames, which are rendered via a rasterizer using frame-specific camera parameters. Additionally, a high-frequency-refined post-rendering alignment mechanism, learned from large-scale audio-video pairs by a pretrained model, is incorporated to enhance per-frame generation and achieve more accurate lip synchronization. Extensive experiments on widely used datasets for talking head generation demonstrate that our method outperforms recent state-of-the-art approaches in producing high-fidelity, lip-synced talking head videos.
Paper Structure (17 sections, 16 equations, 6 figures, 5 tables)

This paper contains 17 sections, 16 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustration of lip-synchronization errors and facial jitter. The top row shows generated frames, while the bottom row shows the corresponding ground-truth frames. Arrows highlight noticeable discrepancies, demonstrating that existing methods suffer from lip-synchronization inaccuracies and unstable facial motions.
  • Figure 2: The main proposed FG-3DGS framework for 3D talking head generation. Given a specific portrait speech video, FG-3DGS first decomposes the head into three regions: the face, eyes, and mouth. After performing static 3D Gaussian reconstruction, a conditional deformation attention mechanism predicts Gaussian offsets based on the encoded audio feature $f_a$ and expression feature $f_e$. The outputs from the static and dynamic components are then combined, and a 3D Gaussian rasterizer renders the dynamic Gaussians into images under varying camera parameters. To enhance lip synchronization, the high-frequency refined post-rendering alignment is applied in the final stage.
  • Figure 3: Qualitative comparison of talking head synthesis across different methods. From top to bottom, rows show the ground truth and results produced by GeneFace, ER-NeRF, TalkingGaussian, and the proposed method. Close-up views highlight lip movements and fine facial details. The proposed method produces more accurate lip synchronization and more stable facial details, closely matching the ground truth. Zooming in is recommended for better visualization.
  • Figure 4: User study results. Mean scores from 20 participants on a 5-point scale, where higher values indicate better performance. The evaluation covers image quality, video realism, and lip synchronization. Methods A--F correspond to TalkLip, DINet, ER-NeRF, GeneFace, TalkingGaussian, and the proposed method (FG-3DGS), respectively.
  • Figure 5: Qualitative results of the ablation study on the proposed components. Each column shows the output obtained by removing one module (w/o FAD, w/o FAM, w/o HRPA) compared with the ground truth and the full model (Ours). Red bounding boxes highlight regions with visible artifacts and degradation, such as over-smoothing and inaccurate lip-synchronization.
  • ...and 1 more figures