Toward Fine-Grained Facial Control in 3D Talking Head Generation
Shaoyang Xie, Xiaofeng Cong, Baosheng Yu, Zhipeng Gui, Jie Gui, Yuan Yan Tang, James Tin-Yau Kwok
TL;DR
This paper tackles the challenge of fine-grained control in audio-driven 3D talking head generation, particularly lip-sync accuracy and temporal stability. It introduces FG-3DGS, a framework that performs frequency-aware disentanglement by modeling low-frequency facial regions with a shared MLP and high-frequency regions (eyes, mouth) with region-specific networks, using Gaussian deltas ${\Delta G_r}$ to drive motion. A high-frequency refined post-rendering alignment, guided by a lip-sync discriminator, further enhances synchronization and per-frame accuracy. Experiments show FG-3DGS achieves state-of-the-art fidelity and lip synchronization across reconstruction and cross-subject tests, demonstrating the effectiveness of region-specific motion modeling and post-rendering alignment for realistic, real-time talking heads.
Abstract
Audio-driven talking head generation is a core component of digital avatars, and 3D Gaussian Splatting has shown strong performance in real-time rendering of high-fidelity talking heads. However, achieving precise control over fine-grained facial movements remains a significant challenge, particularly due to lip-synchronization inaccuracies and facial jitter, both of which can contribute to the uncanny valley effect. To address these challenges, we propose Fine-Grained 3D Gaussian Splatting (FG-3DGS), a novel framework that enables temporally consistent and high-fidelity talking head generation. Our method introduces a frequency-aware disentanglement strategy to explicitly model facial regions based on their motion characteristics. Low-frequency regions, such as the cheeks, nose, and forehead, are jointly modeled using a standard MLP, while high-frequency regions, including the eyes and mouth, are captured separately using a dedicated network guided by facial area masks. The predicted motion dynamics, represented as Gaussian deltas, are applied to the static Gaussians to generate the final head frames, which are rendered via a rasterizer using frame-specific camera parameters. Additionally, a high-frequency-refined post-rendering alignment mechanism, learned from large-scale audio-video pairs by a pretrained model, is incorporated to enhance per-frame generation and achieve more accurate lip synchronization. Extensive experiments on widely used datasets for talking head generation demonstrate that our method outperforms recent state-of-the-art approaches in producing high-fidelity, lip-synced talking head videos.
