Table of Contents
Fetching ...

Playmate: Flexible Control of Portrait Animation via 3D-Implicit Space Guided Diffusion

Xingpei Ma, Jiaran Cai, Yuansheng Guan, Shenneng Huang, Qiang Zhang, Shunsi Zhang

TL;DR

Playmate tackles uncontrolled factors in audio-driven portrait animation by decoupling facial attributes within a 3D-implicit space and employing a diffusion transformer guided by audio. A motion-decoupled module enhances disentanglement of expression, lip movement, and head pose, while an emotion-control module injects explicit emotion cues into the latent space to enable fine-grained control. Extensive experiments demonstrate state-of-the-art video quality and strong lip synchronization, with improved flexibility in manipulating emotion and head pose across identities. The approach broadens practical applications in avatar generation and media production, while acknowledging limitations such as a focus on facial regions and potential artifacts from 3D-implicit representations, with future work aiming at full-body extension and rendering improvements.

Abstract

Recent diffusion-based talking face generation models have demonstrated impressive potential in synthesizing videos that accurately match a speech audio clip with a given reference identity. However, existing approaches still encounter significant challenges due to uncontrollable factors, such as inaccurate lip-sync, inappropriate head posture and the lack of fine-grained control over facial expressions. In order to introduce more face-guided conditions beyond speech audio clips, a novel two-stage training framework Playmate is proposed to generate more lifelike facial expressions and talking faces. In the first stage, we introduce a decoupled implicit 3D representation along with a meticulously designed motion-decoupled module to facilitate more accurate attribute disentanglement and generate expressive talking videos directly from audio cues. Then, in the second stage, we introduce an emotion-control module to encode emotion control information into the latent space, enabling fine-grained control over emotions and thereby achieving the ability to generate talking videos with desired emotion. Extensive experiments demonstrate that Playmate not only outperforms existing state-of-the-art methods in terms of video quality, but also exhibits strong competitiveness in lip synchronization while offering improved flexibility in controlling emotion and head pose. The code will be available at https://github.com/Playmate111/Playmate.

Playmate: Flexible Control of Portrait Animation via 3D-Implicit Space Guided Diffusion

TL;DR

Playmate tackles uncontrolled factors in audio-driven portrait animation by decoupling facial attributes within a 3D-implicit space and employing a diffusion transformer guided by audio. A motion-decoupled module enhances disentanglement of expression, lip movement, and head pose, while an emotion-control module injects explicit emotion cues into the latent space to enable fine-grained control. Extensive experiments demonstrate state-of-the-art video quality and strong lip synchronization, with improved flexibility in manipulating emotion and head pose across identities. The approach broadens practical applications in avatar generation and media production, while acknowledging limitations such as a focus on facial regions and potential artifacts from 3D-implicit representations, with future work aiming at full-body extension and rendering improvements.

Abstract

Recent diffusion-based talking face generation models have demonstrated impressive potential in synthesizing videos that accurately match a speech audio clip with a given reference identity. However, existing approaches still encounter significant challenges due to uncontrollable factors, such as inaccurate lip-sync, inappropriate head posture and the lack of fine-grained control over facial expressions. In order to introduce more face-guided conditions beyond speech audio clips, a novel two-stage training framework Playmate is proposed to generate more lifelike facial expressions and talking faces. In the first stage, we introduce a decoupled implicit 3D representation along with a meticulously designed motion-decoupled module to facilitate more accurate attribute disentanglement and generate expressive talking videos directly from audio cues. Then, in the second stage, we introduce an emotion-control module to encode emotion control information into the latent space, enabling fine-grained control over emotions and thereby achieving the ability to generate talking videos with desired emotion. Extensive experiments demonstrate that Playmate not only outperforms existing state-of-the-art methods in terms of video quality, but also exhibits strong competitiveness in lip synchronization while offering improved flexibility in controlling emotion and head pose. The code will be available at https://github.com/Playmate111/Playmate.

Paper Structure

This paper contains 19 sections, 10 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 2: Framework of our approach. Playmate is a two-stage training framework that leverages a 3D-Implicit Space Guided Diffusion Model to generate lifelike talking faces. In the first stage, Playmate utilizes a motion-decoupled module to enhance attribute disentanglement accuracy and trains a diffusion transformer to generate motion sequences directly from audio cues. In the second stage, we use an emotion-control module to encode emotion control information into the latent space, enabling fine-grained control over emotions, thereby improving flexibility in controlling emotion and head pose.
  • Figure 3: The structure of the Emotion-control Module.
  • Figure 4: Qualitative comparisons with state-of-the-art methods. Previous methods were prone to generate artifacts in tooth rendering(e.g., (a)-row 6, column 3; (b)-row 3, column 1) and lip synchronization(e.g., (a)-row 4, column 7; (b)-row 2, column 7). Conversely, our approach boasts a superior decoupling capability, which allows it to create more lifelike talking head videos. For more comparison details, please see the Appendix.
  • Figure 5: Visualization results in different style images. Playmate can drive a wide range of portraits, including real humans, animations, artistic portraits, and even animals.
  • Figure 6: Visualization results of emotion control. Each row shows the generation for different identity under different emotional conditions using the same audio clip, demonstrating the flexibility in controlling emotion of Playmate.
  • ...and 4 more figures