Dual Audio-Centric Modality Coupling for Talking Head Generation
Ao Fu, Ziqi Ni, Yi Zhou
TL;DR
The paper addresses the challenge of generating realistic audio-driven talking head videos by bridging semantic audio content with precise lip synchronization within a NeRF-based framework. It introduces Dual Audio-centric Modality Coupling (DAMC), which uses two encoders—Content-Aware Encoder $E_c$ for linguistic content and Dynamic-Sync Encoder $E_d$ for visual synchronization—fused through the Cross-Synchronized Fusion Module (CSFM) to produce a unified embedding $f_a$. The method incorporates an inference-enhanced TTS module and extends ER-NeRF with tri-plane hash-based rendering, employing a two-stage optimization that transitions from MSE to perceptual LPIPS losses for high-quality geometry and texture. Empirical results show DAMC outperforms state-of-the-art methods in lip synchronization and image quality, with strong generalization to synthetic, TTS-generated speech and robust Mandarin handling via vocabulary simplification. Overall, DAMC offers a scalable, high-fidelity approach for audio-driven facial animation with potential applications in virtual avatars and digital media.
Abstract
The generation of audio-driven talking head videos is a key challenge in computer vision and graphics, with applications in virtual avatars and digital media. Traditional approaches often struggle with capturing the complex interaction between audio and facial dynamics, leading to lip synchronization and visual quality issues. In this paper, we propose a novel NeRF-based framework, Dual Audio-Centric Modality Coupling (DAMC), which effectively integrates content and dynamic features from audio inputs. By leveraging a dual encoder structure, DAMC captures semantic content through the Content-Aware Encoder and ensures precise visual synchronization through the Dynamic-Sync Encoder. These features are fused using a Cross-Synchronized Fusion Module (CSFM), enhancing content representation and lip synchronization. Extensive experiments show that our method outperforms existing state-of-the-art approaches in key metrics such as lip synchronization accuracy and image quality, demonstrating robust generalization across various audio inputs, including synthetic speech from text-to-speech (TTS) systems. Our results provide a promising solution for high-quality, audio-driven talking head generation and present a scalable approach for creating realistic talking heads.
