Table of Contents
Fetching ...

DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentangled Diffusion Model

Kangwei Liu, Junwu Liu, Yun Cao, Jinlin Guo, Xiaowei Yi

TL;DR

DisentTalk tackles cross-lingual talking-face generation by decoupling 3DMM expression parameters into semantic subspaces and integrating them into a spatio-temporal diffusion framework. By disentangling lip, eye, and global dynamics, the method achieves fine-grained regional control while maintaining temporal coherence, and is trained with a lip-sync objective on HuBERT-based audio features. The CHDTF dataset enables robust cross-lingual evaluation, and extensive experiments show state-of-the-art lip synchronization, expression naturalness, and temporal stability, with real-time performance. This approach advances cross-lingual facial animation by bridging geometric parameter control with diffusion-based generation and region-aware temporal modeling.

Abstract

Recent advances in talking face generation have significantly improved facial animation synthesis. However, existing approaches face fundamental limitations: 3DMM-based methods maintain temporal consistency but lack fine-grained regional control, while Stable Diffusion-based methods enable spatial manipulation but suffer from temporal inconsistencies. The integration of these approaches is hindered by incompatible control mechanisms and semantic entanglement of facial representations. This paper presents DisentTalk, introducing a data-driven semantic disentanglement framework that decomposes 3DMM expression parameters into meaningful subspaces for fine-grained facial control. Building upon this disentangled representation, we develop a hierarchical latent diffusion architecture that operates in 3DMM parameter space, integrating region-aware attention mechanisms to ensure both spatial precision and temporal coherence. To address the scarcity of high-quality Chinese training data, we introduce CHDTF, a Chinese high-definition talking face dataset. Extensive experiments show superior performance over existing methods across multiple metrics, including lip synchronization, expression quality, and temporal consistency. Project Page: https://kangweiiliu.github.io/DisentTalk.

DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentangled Diffusion Model

TL;DR

DisentTalk tackles cross-lingual talking-face generation by decoupling 3DMM expression parameters into semantic subspaces and integrating them into a spatio-temporal diffusion framework. By disentangling lip, eye, and global dynamics, the method achieves fine-grained regional control while maintaining temporal coherence, and is trained with a lip-sync objective on HuBERT-based audio features. The CHDTF dataset enables robust cross-lingual evaluation, and extensive experiments show state-of-the-art lip synchronization, expression naturalness, and temporal stability, with real-time performance. This approach advances cross-lingual facial animation by bridging geometric parameter control with diffusion-based generation and region-aware temporal modeling.

Abstract

Recent advances in talking face generation have significantly improved facial animation synthesis. However, existing approaches face fundamental limitations: 3DMM-based methods maintain temporal consistency but lack fine-grained regional control, while Stable Diffusion-based methods enable spatial manipulation but suffer from temporal inconsistencies. The integration of these approaches is hindered by incompatible control mechanisms and semantic entanglement of facial representations. This paper presents DisentTalk, introducing a data-driven semantic disentanglement framework that decomposes 3DMM expression parameters into meaningful subspaces for fine-grained facial control. Building upon this disentangled representation, we develop a hierarchical latent diffusion architecture that operates in 3DMM parameter space, integrating region-aware attention mechanisms to ensure both spatial precision and temporal coherence. To address the scarcity of high-quality Chinese training data, we introduce CHDTF, a Chinese high-definition talking face dataset. Extensive experiments show superior performance over existing methods across multiple metrics, including lip synchronization, expression quality, and temporal consistency. Project Page: https://kangweiiliu.github.io/DisentTalk.

Paper Structure

This paper contains 10 sections, 10 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparison of talking face generation approaches. (A) 3DMM-based methods: temporally consistent but lacks regional control. (B) Stable diffusion-based methods enable spatial control but are temporally inconsistent. (C) Our DisentTalk: achieves both through the disentangled latent diffusion model.
  • Figure 2: Overview of our spatio-temporal aware diffusion framework. The model leverages disentangled 3DMM parameters and hierarchical attention mechanisms to achieve precise control over distinct facial regions while maintaining temporal coherence.
  • Figure 3: 3DMM parameter disentanglement process. We decompose expression parameters into lip articulation, eye dynamics, and global expression subspaces through data-driven analysis of facial modifications.
  • Figure 4: Qualitative comparison with state-of-the-art methods. Our approach demonstrates superior performance in lip synchronization, natural facial expressions, and temporal consistency.
  • Figure 5: Eye aspect ratio analysis over time. Our method generates physiologically plausible blinking patterns compared to baselines.
  • ...and 1 more figures