Table of Contents
Fetching ...

LokiTalk: Learning Fine-Grained and Generalizable Correspondences to Enhance NeRF-based Talking Head Synthesis

Tianqi Li, Ruobing Zheng, Bonan Li, Zicheng Zhang, Meng Wang, Jingdong Chen, Ming Yang

TL;DR

LokiTalk tackles visual artifacts and high training costs in NeRF-based talking head synthesis by learning fine-grained, region-level correspondences between driving signals and portrait motion, and by leveraging ID-aware knowledge transfer to share common dynamics across identities. The framework combines Region-Specific Deformation Fields (face and torso) with driving signals (Audio2Motion, eye-blink cues, and head pose) and a two-stage deformation pipeline to achieve more accurate lip sync, eye behavior, and torso continuity. An ID-Aware Knowledge Transfer module pretrains on multi-identity data to capture universal static/dynamic patterns while injecting ID-specific cues during fine-tuning via ID-Encoder and hyper-networks, reducing data needs and training time. Quantitative and qualitative evaluations show LokiTalk outperforms prior NeRF-based methods in fidelity and efficiency, with ablations confirming the value of region-specific modeling and identity-aware transfer for scalable, high-quality digital avatars.

Abstract

Despite significant progress in talking head synthesis since the introduction of Neural Radiance Fields (NeRF), visual artifacts and high training costs persist as major obstacles to large-scale commercial adoption. We propose that identifying and establishing fine-grained and generalizable correspondences between driving signals and generated results can simultaneously resolve both problems. Here we present LokiTalk, a novel framework designed to enhance NeRF-based talking heads with lifelike facial dynamics and improved training efficiency. To achieve fine-grained correspondences, we introduce Region-Specific Deformation Fields, which decompose the overall portrait motion into lip movements, eye blinking, head pose, and torso movements. By hierarchically modeling the driving signals and their associated regions through two cascaded deformation fields, we significantly improve dynamic accuracy and minimize synthetic artifacts. Furthermore, we propose ID-Aware Knowledge Transfer, a plug-and-play module that learns generalizable dynamic and static correspondences from multi-identity videos, while simultaneously extracting ID-specific dynamic and static features to refine the depiction of individual characters. Comprehensive evaluations demonstrate that LokiTalk delivers superior high-fidelity results and training efficiency compared to previous methods. The code will be released upon acceptance.

LokiTalk: Learning Fine-Grained and Generalizable Correspondences to Enhance NeRF-based Talking Head Synthesis

TL;DR

LokiTalk tackles visual artifacts and high training costs in NeRF-based talking head synthesis by learning fine-grained, region-level correspondences between driving signals and portrait motion, and by leveraging ID-aware knowledge transfer to share common dynamics across identities. The framework combines Region-Specific Deformation Fields (face and torso) with driving signals (Audio2Motion, eye-blink cues, and head pose) and a two-stage deformation pipeline to achieve more accurate lip sync, eye behavior, and torso continuity. An ID-Aware Knowledge Transfer module pretrains on multi-identity data to capture universal static/dynamic patterns while injecting ID-specific cues during fine-tuning via ID-Encoder and hyper-networks, reducing data needs and training time. Quantitative and qualitative evaluations show LokiTalk outperforms prior NeRF-based methods in fidelity and efficiency, with ablations confirming the value of region-specific modeling and identity-aware transfer for scalable, high-quality digital avatars.

Abstract

Despite significant progress in talking head synthesis since the introduction of Neural Radiance Fields (NeRF), visual artifacts and high training costs persist as major obstacles to large-scale commercial adoption. We propose that identifying and establishing fine-grained and generalizable correspondences between driving signals and generated results can simultaneously resolve both problems. Here we present LokiTalk, a novel framework designed to enhance NeRF-based talking heads with lifelike facial dynamics and improved training efficiency. To achieve fine-grained correspondences, we introduce Region-Specific Deformation Fields, which decompose the overall portrait motion into lip movements, eye blinking, head pose, and torso movements. By hierarchically modeling the driving signals and their associated regions through two cascaded deformation fields, we significantly improve dynamic accuracy and minimize synthetic artifacts. Furthermore, we propose ID-Aware Knowledge Transfer, a plug-and-play module that learns generalizable dynamic and static correspondences from multi-identity videos, while simultaneously extracting ID-specific dynamic and static features to refine the depiction of individual characters. Comprehensive evaluations demonstrate that LokiTalk delivers superior high-fidelity results and training efficiency compared to previous methods. The code will be released upon acceptance.

Paper Structure

This paper contains 23 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of the proposed Region-Specific Deformation Fields. The driving signals (audio, pose, eye ratio) participate in the two-stage prediction of face and torso deformation fields, respectively. The mask subsequent to each driving signal represents the cross-attention loss between the driving signal and the corresponding region. A colored cubic grid is used to illustrate the predicted deformation fields, with the internal heat maps indicating the magnitude of the deformation amplitude.
  • Figure 2: ID-Aware Knowledge Transfer. The blue modules are the common correspondences among multiple identities, comprising dynamic (light blue) and static (dark blue) correspondences. The colored modules are dynamic (facial actions) and static information (geometry and appearance) of individual identities. During the pre-training (entire yellow panel), both upper and lower parts are trained simultaneously on multi-ID data, allowing the model to learn universal information while extracting individual information. When fine-tuning, the lower half will continue training based on the id-aware initialization parameters obtained from the ID-Encoder.
  • Figure 3: The comparison of the keyframes and details of generated portraits. We mark the un-sync and bad rendering quality results with red arrows, around which the generated eyes, mouths, neck broken or wrinkles are clearly not in line with the real ones. We also show the details of the eyes, teeth, forehead wrinkles and mouth area. Please zoom in for better visualization.
  • Figure 4: Comparison of the depth maps generated by our method and the baseline methods. Our depth map shows more details on the face area, especially the mouth and eye expressions (differences between open and closed eyes). The connection between the face and torso is more consistent in our results.
  • Figure 5: Heatmaps of the $\sum{ \left \| \Delta \mathbf{x}_{face} \right \|}$ and $\sum{ \left \| \Delta \mathbf{x}_{torso} \right \|}$. Brighter areas represent regions with more dynamic deformations. The reason for the bright area close to the hair edges is due to the jitter in parsing results which mislead learning of the deformations.