Table of Contents
Fetching ...

EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis

Shuai Tan, Bin Ji, Mengxiao Bi, Ye Pan

TL;DR

EDTalk tackles the challenge of fine-grained, disentangled control over mouth shape, head pose, and emotional expression in talking head synthesis, enabling both video- and audio-driven generation. It introduces three component-aware latent spaces with orthogonal base banks and a two-stage training strategy to achieve complete decoupling, plus an Audio-to-Motion module that maps audio (and semantic cues) to bank weights for lip, pose, and expression synthesis. The approach supports one-shot video-driven generation and reconstructs audio-driven motion through probabilistic pose modeling and semantically-aware expression generation, using lightweight modules to enable efficient training. Experiments on MEAD and HDTF demonstrate state-of-the-art performance across quality, lip-sync accuracy, and emotional realism, with notable efficiency advantages over prior disentanglement methods. The work advances practical, multimodal talking head generation while also addressing limitations and ethical considerations.

Abstract

Achieving disentangled control over multiple facial motions and accommodating diverse input modalities greatly enhances the application and entertainment of the talking head generation. This necessitates a deep exploration of the decoupling space for facial features, ensuring that they a) operate independently without mutual interference and b) can be preserved to share with different modal input, both aspects often neglected in existing methods. To address this gap, this paper proposes a novel Efficient Disentanglement framework for Talking head generation (EDTalk). Our framework enables individual manipulation of mouth shape, head pose, and emotional expression, conditioned on video or audio inputs. Specifically, we employ three lightweight modules to decompose the facial dynamics into three distinct latent spaces representing mouth, pose, and expression, respectively. Each space is characterized by a set of learnable bases whose linear combinations define specific motions. To ensure independence and accelerate training, we enforce orthogonality among bases and devise an efficient training strategy to allocate motion responsibilities to each space without relying on external knowledge. The learned bases are then stored in corresponding banks, enabling shared visual priors with audio input. Furthermore, considering the properties of each space, we propose an Audio-to-Motion module for audio-driven talking head synthesis. Experiments are conducted to demonstrate the effectiveness of EDTalk. We recommend watching the project website: https://tanshuai0219.github.io/EDTalk/

EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis

TL;DR

EDTalk tackles the challenge of fine-grained, disentangled control over mouth shape, head pose, and emotional expression in talking head synthesis, enabling both video- and audio-driven generation. It introduces three component-aware latent spaces with orthogonal base banks and a two-stage training strategy to achieve complete decoupling, plus an Audio-to-Motion module that maps audio (and semantic cues) to bank weights for lip, pose, and expression synthesis. The approach supports one-shot video-driven generation and reconstructs audio-driven motion through probabilistic pose modeling and semantically-aware expression generation, using lightweight modules to enable efficient training. Experiments on MEAD and HDTF demonstrate state-of-the-art performance across quality, lip-sync accuracy, and emotional realism, with notable efficiency advantages over prior disentanglement methods. The work advances practical, multimodal talking head generation while also addressing limitations and ethical considerations.

Abstract

Achieving disentangled control over multiple facial motions and accommodating diverse input modalities greatly enhances the application and entertainment of the talking head generation. This necessitates a deep exploration of the decoupling space for facial features, ensuring that they a) operate independently without mutual interference and b) can be preserved to share with different modal input, both aspects often neglected in existing methods. To address this gap, this paper proposes a novel Efficient Disentanglement framework for Talking head generation (EDTalk). Our framework enables individual manipulation of mouth shape, head pose, and emotional expression, conditioned on video or audio inputs. Specifically, we employ three lightweight modules to decompose the facial dynamics into three distinct latent spaces representing mouth, pose, and expression, respectively. Each space is characterized by a set of learnable bases whose linear combinations define specific motions. To ensure independence and accelerate training, we enforce orthogonality among bases and devise an efficient training strategy to allocate motion responsibilities to each space without relying on external knowledge. The learned bases are then stored in corresponding banks, enabling shared visual priors with audio input. Furthermore, considering the properties of each space, we propose an Audio-to-Motion module for audio-driven talking head synthesis. Experiments are conducted to demonstrate the effectiveness of EDTalk. We recommend watching the project website: https://tanshuai0219.github.io/EDTalk/
Paper Structure (50 sections, 18 equations, 15 figures, 6 tables)

This paper contains 50 sections, 18 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Illustrative animations produced by EDTalk. Given an identity source, EDTalk synthesizes talking face videos characterized by mouth shapes, head poses, and expressions consistent with mouth GT, pose source and expression source. These facial dynamics can also be inferred directly from driven audio. Importantly, EDTalk demonstrates superior efficiency in disentanglement training compared to other methods.
  • Figure 2: Illustration of our proposed EDTalk. (a) EDTalk framework. Given an identity source $I^i$ and various driving images $I^*$ ($* \in \{m,p,e\}$) for controlling corresponding facial components, EDTalk animates the identity image $I^i$ to mimic the mouth shape, head pose, and expression of $I^m$, $I^p$ and $I^e$ with the assistance of three Component-aware Latent Navigation modules: MLN, PLN and ELN. (b) Efficient Disentanglement. The disentanglement process consists of two parts: Mouth-Pose decouple and Expression Decouple. For the former, we introduce the cross-reconstruction training strategy aimed at separating mouth shape and head pose. For the latter, we achieve expression disentanglement using self-reconstruction complementary learning.
  • Figure 3: The overview of Audio-to-Motion. We design three modules to predict weights $\hat{W}^p$, $\hat{W}^p$, $\hat{W}^p$ for mouth, pose, expression.
  • Figure 4: Qualitative comparisons with state-of-the-art methods. See full comparison in \ref{['fig:full_compare5']}.
  • Figure 5: Resources for training.
  • ...and 10 more figures