LokiTalk: Learning Fine-Grained and Generalizable Correspondences to Enhance NeRF-based Talking Head Synthesis
Tianqi Li, Ruobing Zheng, Bonan Li, Zicheng Zhang, Meng Wang, Jingdong Chen, Ming Yang
TL;DR
LokiTalk tackles visual artifacts and high training costs in NeRF-based talking head synthesis by learning fine-grained, region-level correspondences between driving signals and portrait motion, and by leveraging ID-aware knowledge transfer to share common dynamics across identities. The framework combines Region-Specific Deformation Fields (face and torso) with driving signals (Audio2Motion, eye-blink cues, and head pose) and a two-stage deformation pipeline to achieve more accurate lip sync, eye behavior, and torso continuity. An ID-Aware Knowledge Transfer module pretrains on multi-identity data to capture universal static/dynamic patterns while injecting ID-specific cues during fine-tuning via ID-Encoder and hyper-networks, reducing data needs and training time. Quantitative and qualitative evaluations show LokiTalk outperforms prior NeRF-based methods in fidelity and efficiency, with ablations confirming the value of region-specific modeling and identity-aware transfer for scalable, high-quality digital avatars.
Abstract
Despite significant progress in talking head synthesis since the introduction of Neural Radiance Fields (NeRF), visual artifacts and high training costs persist as major obstacles to large-scale commercial adoption. We propose that identifying and establishing fine-grained and generalizable correspondences between driving signals and generated results can simultaneously resolve both problems. Here we present LokiTalk, a novel framework designed to enhance NeRF-based talking heads with lifelike facial dynamics and improved training efficiency. To achieve fine-grained correspondences, we introduce Region-Specific Deformation Fields, which decompose the overall portrait motion into lip movements, eye blinking, head pose, and torso movements. By hierarchically modeling the driving signals and their associated regions through two cascaded deformation fields, we significantly improve dynamic accuracy and minimize synthetic artifacts. Furthermore, we propose ID-Aware Knowledge Transfer, a plug-and-play module that learns generalizable dynamic and static correspondences from multi-identity videos, while simultaneously extracting ID-specific dynamic and static features to refine the depiction of individual characters. Comprehensive evaluations demonstrate that LokiTalk delivers superior high-fidelity results and training efficiency compared to previous methods. The code will be released upon acceptance.
