Table of Contents
Fetching ...

RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

Xiaozhong Ji, Chuming Lin, Zhonggan Ding, Ying Tai, Junwei Zhu, Xiaobin Hu, Donghao Luo, Yanhao Ge, Chengjie Wang

TL;DR

RealTalk tackles the challenge of real-time, identity-preserving audio-driven talking-face generation by introducing a two-stage framework: an audio-to-expression transformer that leverages enriched 3D facial priors and historical expressions, and an efficient expression-to-face renderer equipped with a learnable mask and a Facial Identity Alignment (FIA) module. The architecture uses cross-modal attention on 3D shape and expression priors, AdaIN-based injection of predicted 3D coefficients, and single-frame texture transfer to achieve high fidelity at 30 FPS. Key contributions include the Improved Facial Prior with cross-modal attention, the Learnable Mask bridging audio-to-face and identity reference, and the FIA module enabling precise lip control with efficient texture transfer from one frame. Extensive experiments on VoxCeleb1, MEAD, and HDTF demonstrate superior lip-speech synchronization, lower FID/LPIPS/SSIM, and real-time performance compared to state-of-the-art methods, while ablation studies validate the impact of priors, masking, and FIA design. The work advances practical real-time digital-human applications but also highlights potential misuse, suggesting safeguards and future research to mitigate risks.

Abstract

Person-generic audio-driven face generation is a challenging task in computer vision. Previous methods have achieved remarkable progress in audio-visual synchronization, but there is still a significant gap between current results and practical applications. The challenges are two-fold: 1) Preserving unique individual traits for achieving high-precision lip synchronization. 2) Generating high-quality facial renderings in real-time performance. In this paper, we propose a novel generalized audio-driven framework RealTalk, which consists of an audio-to-expression transformer and a high-fidelity expression-to-face renderer. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. By incorporating cross-modal attention on the enriched facial priors, we can effectively align lip movements with audio, thus attaining greater precision in expression prediction. In the second component, we design a lightweight facial identity alignment (FIA) module which includes a lip-shape control structure and a face texture reference structure. This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules. Our experimental results, both quantitative and qualitative, on public datasets demonstrate the clear advantages of our method in terms of lip-speech synchronization and generation quality. Furthermore, our method is efficient and requires fewer computational resources, making it well-suited to meet the needs of practical applications.

RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

TL;DR

RealTalk tackles the challenge of real-time, identity-preserving audio-driven talking-face generation by introducing a two-stage framework: an audio-to-expression transformer that leverages enriched 3D facial priors and historical expressions, and an efficient expression-to-face renderer equipped with a learnable mask and a Facial Identity Alignment (FIA) module. The architecture uses cross-modal attention on 3D shape and expression priors, AdaIN-based injection of predicted 3D coefficients, and single-frame texture transfer to achieve high fidelity at 30 FPS. Key contributions include the Improved Facial Prior with cross-modal attention, the Learnable Mask bridging audio-to-face and identity reference, and the FIA module enabling precise lip control with efficient texture transfer from one frame. Extensive experiments on VoxCeleb1, MEAD, and HDTF demonstrate superior lip-speech synchronization, lower FID/LPIPS/SSIM, and real-time performance compared to state-of-the-art methods, while ablation studies validate the impact of priors, masking, and FIA design. The work advances practical real-time digital-human applications but also highlights potential misuse, suggesting safeguards and future research to mitigate risks.

Abstract

Person-generic audio-driven face generation is a challenging task in computer vision. Previous methods have achieved remarkable progress in audio-visual synchronization, but there is still a significant gap between current results and practical applications. The challenges are two-fold: 1) Preserving unique individual traits for achieving high-precision lip synchronization. 2) Generating high-quality facial renderings in real-time performance. In this paper, we propose a novel generalized audio-driven framework RealTalk, which consists of an audio-to-expression transformer and a high-fidelity expression-to-face renderer. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. By incorporating cross-modal attention on the enriched facial priors, we can effectively align lip movements with audio, thus attaining greater precision in expression prediction. In the second component, we design a lightweight facial identity alignment (FIA) module which includes a lip-shape control structure and a face texture reference structure. This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules. Our experimental results, both quantitative and qualitative, on public datasets demonstrate the clear advantages of our method in terms of lip-speech synchronization and generation quality. Furthermore, our method is efficient and requires fewer computational resources, making it well-suited to meet the needs of practical applications.
Paper Structure (10 sections, 10 equations, 9 figures, 7 tables)

This paper contains 10 sections, 10 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Left: Visual Comparison on lip sync and generation quality with IP-LAP zhong2023identity and DINet zhang2023dinet. Our method achieves precise lip-synced talking faces, closer to the target lip, with higher visual quality. Right: Speed, LMD and FID comparisons. Our method generates talking faces at $30$ FPS on NVIDIA V$100$, showcasing the best LMD and FID scores while maintaining the real-time speed.
  • Figure 2: Framework of our approach. Our network is divided into two parts: Audio-to-expression Transformer, and Expression-to-face Renderer. The preprocessing is to extract 3D shapes $\alpha_{1:N}$, expressions $\beta_{1:N}$, poses $\rho_{1:N}$, and audio feature $w_{1:l}$. In the first part, the shape $\alpha_{1}$ and historical expressions $\beta_{1:N}$ are utilized as Improved Facial Prior to predict $\hat{\beta}_{1:N}$ while preserving identity and intra-personal lip amplitude variations. In the second part, the predicted expressions are injected into the proposed Facial Identity Alignment (FIA) module to inpaint the masked source frame $I_s^m$ the target lip through cross-attention with the identity reference $I_r$.
  • Figure 3: Architecture of the Audio-to-expression Transformer. (Left) The audio, shape, and historical expressions are processed through an encoder to obtain memory features, which are then combined with expression queries in decoder to generate predictions. (Right) The predicted expressions and GT expressions are optimized using reconstruction and vertex losses.
  • Figure 4: Illustration of the learnable mask based on predicted facial expressions. $1$) The estimated $3$D vertex allows us to select points with fixed positions relative to the face. We choose points that emphasizes a larger facial contour to accommodate diverse lip movements. $2$) Comparisons between our learnable mask and the naive mask (IP-LAP), with the left-top in our result representing the target lip, show that our method effectively adjusts the face shape based on the spoken content, while IP-LAP yields unnatural results, e.g. a double chin.
  • Figure 5: Architecture of the Facial Identity Alignment module. The inputs to the FIA module include the predicted facial expression coefficients $\hat{\beta}_t$, the known shape $\alpha_t$ and pose $\rho_t$, the feature from last module $\Bar{F}_{i-1}$, and the identity reference image feature $F_{d-i}^r$. The 3D coefficients are injected through AdaIN, enabling control over the lip. Multi-scale reference features interact with the current features through cross-attention, facilitating effective texture transfer.
  • ...and 4 more figures