Playmate: Flexible Control of Portrait Animation via 3D-Implicit Space Guided Diffusion
Xingpei Ma, Jiaran Cai, Yuansheng Guan, Shenneng Huang, Qiang Zhang, Shunsi Zhang
TL;DR
Playmate tackles uncontrolled factors in audio-driven portrait animation by decoupling facial attributes within a 3D-implicit space and employing a diffusion transformer guided by audio. A motion-decoupled module enhances disentanglement of expression, lip movement, and head pose, while an emotion-control module injects explicit emotion cues into the latent space to enable fine-grained control. Extensive experiments demonstrate state-of-the-art video quality and strong lip synchronization, with improved flexibility in manipulating emotion and head pose across identities. The approach broadens practical applications in avatar generation and media production, while acknowledging limitations such as a focus on facial regions and potential artifacts from 3D-implicit representations, with future work aiming at full-body extension and rendering improvements.
Abstract
Recent diffusion-based talking face generation models have demonstrated impressive potential in synthesizing videos that accurately match a speech audio clip with a given reference identity. However, existing approaches still encounter significant challenges due to uncontrollable factors, such as inaccurate lip-sync, inappropriate head posture and the lack of fine-grained control over facial expressions. In order to introduce more face-guided conditions beyond speech audio clips, a novel two-stage training framework Playmate is proposed to generate more lifelike facial expressions and talking faces. In the first stage, we introduce a decoupled implicit 3D representation along with a meticulously designed motion-decoupled module to facilitate more accurate attribute disentanglement and generate expressive talking videos directly from audio cues. Then, in the second stage, we introduce an emotion-control module to encode emotion control information into the latent space, enabling fine-grained control over emotions and thereby achieving the ability to generate talking videos with desired emotion. Extensive experiments demonstrate that Playmate not only outperforms existing state-of-the-art methods in terms of video quality, but also exhibits strong competitiveness in lip synchronization while offering improved flexibility in controlling emotion and head pose. The code will be available at https://github.com/Playmate111/Playmate.
