Table of Contents
Fetching ...

AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding

Tao Liu, Feilong Chen, Shuai Fan, Chenpeng Du, Qi Chen, Xie Chen, Kai Yu

TL;DR

AniTalker tackles the challenge of generating lifelike talking faces from a single portrait by introducing a universal, identity-decoupled motion representation learned through self-supervision. It combines metric learning and mutual information disentanglement with a Hierarchical Aggregation Layer and a diffusion-based motion generator, enabling both video- and speech-driven animation with diverse and controllable outputs. The framework demonstrates strong quantitative and qualitative gains over state-of-the-art baselines in self- and cross-reenactment, and in audio-driven scenarios, while offering robust generalization to unseen identities and media. The work advances digital avatar realism and versatility, with practical implications for entertainment, education, and interactive media, and notes avenues for improving temporal coherence and rendering in future work.

Abstract

The paper introduces AniTalker, an innovative framework designed to generate lifelike talking faces from a single portrait. Unlike existing models that primarily focus on verbal cues such as lip synchronization and fail to capture the complex dynamics of facial expressions and nonverbal cues, AniTalker employs a universal motion representation. This innovative representation effectively captures a wide range of facial dynamics, including subtle expressions and head movements. AniTalker enhances motion depiction through two self-supervised learning strategies: the first involves reconstructing target video frames from source frames within the same identity to learn subtle motion representations, and the second develops an identity encoder using metric learning while actively minimizing mutual information between the identity and motion encoders. This approach ensures that the motion representation is dynamic and devoid of identity-specific details, significantly reducing the need for labeled data. Additionally, the integration of a diffusion model with a variance adapter allows for the generation of diverse and controllable facial animations. This method not only demonstrates AniTalker's capability to create detailed and realistic facial movements but also underscores its potential in crafting dynamic avatars for real-world applications. Synthetic results can be viewed at https://github.com/X-LANCE/AniTalker.

AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding

TL;DR

AniTalker tackles the challenge of generating lifelike talking faces from a single portrait by introducing a universal, identity-decoupled motion representation learned through self-supervision. It combines metric learning and mutual information disentanglement with a Hierarchical Aggregation Layer and a diffusion-based motion generator, enabling both video- and speech-driven animation with diverse and controllable outputs. The framework demonstrates strong quantitative and qualitative gains over state-of-the-art baselines in self- and cross-reenactment, and in audio-driven scenarios, while offering robust generalization to unseen identities and media. The work advances digital avatar realism and versatility, with practical implications for entertainment, education, and interactive media, and notes avenues for improving temporal coherence and rendering in future work.

Abstract

The paper introduces AniTalker, an innovative framework designed to generate lifelike talking faces from a single portrait. Unlike existing models that primarily focus on verbal cues such as lip synchronization and fail to capture the complex dynamics of facial expressions and nonverbal cues, AniTalker employs a universal motion representation. This innovative representation effectively captures a wide range of facial dynamics, including subtle expressions and head movements. AniTalker enhances motion depiction through two self-supervised learning strategies: the first involves reconstructing target video frames from source frames within the same identity to learn subtle motion representations, and the second develops an identity encoder using metric learning while actively minimizing mutual information between the identity and motion encoders. This approach ensures that the motion representation is dynamic and devoid of identity-specific details, significantly reducing the need for labeled data. Additionally, the integration of a diffusion model with a variance adapter allows for the generation of diverse and controllable facial animations. This method not only demonstrates AniTalker's capability to create detailed and realistic facial movements but also underscores its potential in crafting dynamic avatars for real-world applications. Synthetic results can be viewed at https://github.com/X-LANCE/AniTalker.
Paper Structure (35 sections, 6 equations, 10 figures, 8 tables)

This paper contains 35 sections, 6 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: The AniTalker framework comprises two main components: learning a universal motion representation and then generating and manipulating this representation through a sequence model. Specifically, the first part aims to learn a robust motion representation by employing metric learning (ML), mutual information disentanglement (MID), and Hierarchical Aggregation Layer (HAL). Subsequently, this motion representation can be used for further generation and manipulation.
  • Figure 2: Variance Adapter Block. Each block models a single attribute and can be iterated multiple times, where $N$ represents the number of attributes.
  • Figure 3: Cross-Reenactment Visualization: This task involves transferring actions from a target portrait to a source portrait to evaluate each algorithm's ability to separate motion and appearance. Starting from the third column, each column represents the output from a different algorithm. The results highlight our method's superior ability to preserve fidelity in both motion transfer and appearance retention.
  • Figure 4: Visual comparison of the speech-driven method in self- and cross-driven scenarios. Phonetic sounds are highlighted in red.
  • Figure 5: The weights of motion representation from different layers of the Image Encoder.
  • ...and 5 more figures