Table of Contents
Fetching ...

Dual Audio-Centric Modality Coupling for Talking Head Generation

Ao Fu, Ziqi Ni, Yi Zhou

TL;DR

The paper addresses the challenge of generating realistic audio-driven talking head videos by bridging semantic audio content with precise lip synchronization within a NeRF-based framework. It introduces Dual Audio-centric Modality Coupling (DAMC), which uses two encoders—Content-Aware Encoder $E_c$ for linguistic content and Dynamic-Sync Encoder $E_d$ for visual synchronization—fused through the Cross-Synchronized Fusion Module (CSFM) to produce a unified embedding $f_a$. The method incorporates an inference-enhanced TTS module and extends ER-NeRF with tri-plane hash-based rendering, employing a two-stage optimization that transitions from MSE to perceptual LPIPS losses for high-quality geometry and texture. Empirical results show DAMC outperforms state-of-the-art methods in lip synchronization and image quality, with strong generalization to synthetic, TTS-generated speech and robust Mandarin handling via vocabulary simplification. Overall, DAMC offers a scalable, high-fidelity approach for audio-driven facial animation with potential applications in virtual avatars and digital media.

Abstract

The generation of audio-driven talking head videos is a key challenge in computer vision and graphics, with applications in virtual avatars and digital media. Traditional approaches often struggle with capturing the complex interaction between audio and facial dynamics, leading to lip synchronization and visual quality issues. In this paper, we propose a novel NeRF-based framework, Dual Audio-Centric Modality Coupling (DAMC), which effectively integrates content and dynamic features from audio inputs. By leveraging a dual encoder structure, DAMC captures semantic content through the Content-Aware Encoder and ensures precise visual synchronization through the Dynamic-Sync Encoder. These features are fused using a Cross-Synchronized Fusion Module (CSFM), enhancing content representation and lip synchronization. Extensive experiments show that our method outperforms existing state-of-the-art approaches in key metrics such as lip synchronization accuracy and image quality, demonstrating robust generalization across various audio inputs, including synthetic speech from text-to-speech (TTS) systems. Our results provide a promising solution for high-quality, audio-driven talking head generation and present a scalable approach for creating realistic talking heads.

Dual Audio-Centric Modality Coupling for Talking Head Generation

TL;DR

The paper addresses the challenge of generating realistic audio-driven talking head videos by bridging semantic audio content with precise lip synchronization within a NeRF-based framework. It introduces Dual Audio-centric Modality Coupling (DAMC), which uses two encoders—Content-Aware Encoder for linguistic content and Dynamic-Sync Encoder for visual synchronization—fused through the Cross-Synchronized Fusion Module (CSFM) to produce a unified embedding . The method incorporates an inference-enhanced TTS module and extends ER-NeRF with tri-plane hash-based rendering, employing a two-stage optimization that transitions from MSE to perceptual LPIPS losses for high-quality geometry and texture. Empirical results show DAMC outperforms state-of-the-art methods in lip synchronization and image quality, with strong generalization to synthetic, TTS-generated speech and robust Mandarin handling via vocabulary simplification. Overall, DAMC offers a scalable, high-fidelity approach for audio-driven facial animation with potential applications in virtual avatars and digital media.

Abstract

The generation of audio-driven talking head videos is a key challenge in computer vision and graphics, with applications in virtual avatars and digital media. Traditional approaches often struggle with capturing the complex interaction between audio and facial dynamics, leading to lip synchronization and visual quality issues. In this paper, we propose a novel NeRF-based framework, Dual Audio-Centric Modality Coupling (DAMC), which effectively integrates content and dynamic features from audio inputs. By leveraging a dual encoder structure, DAMC captures semantic content through the Content-Aware Encoder and ensures precise visual synchronization through the Dynamic-Sync Encoder. These features are fused using a Cross-Synchronized Fusion Module (CSFM), enhancing content representation and lip synchronization. Extensive experiments show that our method outperforms existing state-of-the-art approaches in key metrics such as lip synchronization accuracy and image quality, demonstrating robust generalization across various audio inputs, including synthetic speech from text-to-speech (TTS) systems. Our results provide a promising solution for high-quality, audio-driven talking head generation and present a scalable approach for creating realistic talking heads.

Paper Structure

This paper contains 17 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of the proposed framework. The framework consists of two modality feature extractors: the Content-Aware Encoder ($E_c$) and the Dynamic-Sync Encoder ($E_d$), which extract content and dynamic features from the audio, respectively. The extracted features, $f_c$ and $f_d$, are fused using the Cross-Synchronized Fusion Module (CSFM).
  • Figure 2: Structure of the Cross-Synchronized Fusion Module: The module processes inputs $f_c$ and $f_d$ through Projection Block, Content-Dynamic Cross Attention, and Feature Self-Refinement layers to generate the final output $f_a$.
  • Figure 3: Qualitative comparison of our proposed method with various state-of-the-art approaches. The results demonstrate the superiority of our method in terms of visual quality, lip synchronization accuracy, and facial identity preservation.
  • Figure 4: TTS Driven Video Generation. Our method maintains relatively excellent lip shapes and image quality even when driven by TTS-generated speech, such as the mouth shapes when producing various phonemes or remaining silent.