RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network
Xiaozhong Ji, Chuming Lin, Zhonggan Ding, Ying Tai, Junwei Zhu, Xiaobin Hu, Donghao Luo, Yanhao Ge, Chengjie Wang
TL;DR
RealTalk tackles the challenge of real-time, identity-preserving audio-driven talking-face generation by introducing a two-stage framework: an audio-to-expression transformer that leverages enriched 3D facial priors and historical expressions, and an efficient expression-to-face renderer equipped with a learnable mask and a Facial Identity Alignment (FIA) module. The architecture uses cross-modal attention on 3D shape and expression priors, AdaIN-based injection of predicted 3D coefficients, and single-frame texture transfer to achieve high fidelity at 30 FPS. Key contributions include the Improved Facial Prior with cross-modal attention, the Learnable Mask bridging audio-to-face and identity reference, and the FIA module enabling precise lip control with efficient texture transfer from one frame. Extensive experiments on VoxCeleb1, MEAD, and HDTF demonstrate superior lip-speech synchronization, lower FID/LPIPS/SSIM, and real-time performance compared to state-of-the-art methods, while ablation studies validate the impact of priors, masking, and FIA design. The work advances practical real-time digital-human applications but also highlights potential misuse, suggesting safeguards and future research to mitigate risks.
Abstract
Person-generic audio-driven face generation is a challenging task in computer vision. Previous methods have achieved remarkable progress in audio-visual synchronization, but there is still a significant gap between current results and practical applications. The challenges are two-fold: 1) Preserving unique individual traits for achieving high-precision lip synchronization. 2) Generating high-quality facial renderings in real-time performance. In this paper, we propose a novel generalized audio-driven framework RealTalk, which consists of an audio-to-expression transformer and a high-fidelity expression-to-face renderer. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. By incorporating cross-modal attention on the enriched facial priors, we can effectively align lip movements with audio, thus attaining greater precision in expression prediction. In the second component, we design a lightweight facial identity alignment (FIA) module which includes a lip-shape control structure and a face texture reference structure. This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules. Our experimental results, both quantitative and qualitative, on public datasets demonstrate the clear advantages of our method in terms of lip-speech synchronization and generation quality. Furthermore, our method is efficient and requires fewer computational resources, making it well-suited to meet the needs of practical applications.
