One-Shot Pose-Driving Face Animation Platform
He Feng, Donglin Di, Yongjia Ma, Wei Chen, Tonghua Su
TL;DR
This work addresses one-shot face animation where a single reference image is animated under driving signals, tackling expressiveness, continuity, and background stability without identity-specific fine-tuning. It refines an Image2Video framework by adding Face Locator and Motion Frame, and trains on CelebV-HQ and HDTF with CLIP guidance to boost realism. The method builds on AnimateAnyone, using a Reference Net, Denoising UNet, and a Pose Guider to translate pose sequences from DWPose or Audio2Pose into latent facial motion. A Gradio-based demo platform provides Input2Pose and Image2Video pipelines for quick, user-friendly generation, with potential applications in education and personal assistance and scope for future real-time optimization.
Abstract
The objective of face animation is to generate dynamic and expressive talking head videos from a single reference face, utilizing driving conditions derived from either video or audio inputs. Current approaches often require fine-tuning for specific identities and frequently fail to produce expressive videos due to the limited effectiveness of Wav2Pose modules. To facilitate the generation of one-shot and more consecutive talking head videos, we refine an existing Image2Video model by integrating a Face Locator and Motion Frame mechanism. We subsequently optimize the model using extensive human face video datasets, significantly enhancing its ability to produce high-quality and expressive talking head videos. Additionally, we develop a demo platform using the Gradio framework, which streamlines the process, enabling users to quickly create customized talking head videos.
