NeRF-3DTalker: Neural Radiance Field with 3D Prior Aided Audio Disentanglement for Talking Head Synthesis
Xiaoxing Liu, Zhilei Liu, Chongke Bi
TL;DR
NeRF-3DTalker tackles frontal-view bias and audio-visual misalignment in NeRF-based talking head synthesis by integrating 3DMM-derived priors and a 3D Prior Aided Audio Disentanglement that aligns acoustic and visual spaces. The method combines four components: 3D Prior Extraction, 3D Prior Aided Audio Disentanglement, NeRF-based rendering, and a Local-Global Standardized Space with AU semantic and global codebooks, producing $f_{id}$, $f_{exp-aud}$, $f_{exp-style}$, $f_{alb}$, and $f_{illu}$ as conditioning. It demonstrates in extensive experiments on four speakers that the approach yields higher image quality and tighter lip synchronization than state-of-the-art NeRF-based and non-NeRF methods, with quantitative gains in LPIPS, AU Acc, and LMD-79, and qualitative improvements in multi-view realism. This work advances photorealistic, view-consistent talking head synthesis by effectively disentangling audio semantics and leveraging 3D priors to enhance acoustic-visual alignment and rendering coherence, making it suitable for applications requiring realistic, view-flexible talking heads.
Abstract
Talking head synthesis is to synthesize a lip-synchronized talking head video using audio. Recently, the capability of NeRF to enhance the realism and texture details of synthesized talking heads has attracted the attention of researchers. However, most current NeRF methods based on audio are exclusively concerned with the rendering of frontal faces. These methods are unable to generate clear talking heads in novel views. Another prevalent challenge in current 3D talking head synthesis is the difficulty in aligning acoustic and visual spaces, which often results in suboptimal lip-syncing of the generated talking heads. To address these issues, we propose Neural Radiance Field with 3D Prior Aided Audio Disentanglement for Talking Head Synthesis (NeRF-3DTalker). Specifically, the proposed method employs 3D prior information to synthesize clear talking heads with free views. Additionally, we propose a 3D Prior Aided Audio Disentanglement module, which is designed to disentangle the audio into two distinct categories: features related to 3D awarded speech movements and features related to speaking style. Moreover, to reposition the generated frames that are distant from the speaker's motion space in the real space, we have devised a local-global Standardized Space. This method normalizes the irregular positions in the generated frames from both global and local semantic perspectives. Through comprehensive qualitative and quantitative experiments, it has been demonstrated that our NeRF-3DTalker outperforms state-of-the-art in synthesizing realistic talking head videos, exhibiting superior image quality and lip synchronization. Project page: https://nerf-3dtalker.github.io/NeRF-3Dtalker.
