Table of Contents
Fetching ...

One-Shot Pose-Driving Face Animation Platform

He Feng, Donglin Di, Yongjia Ma, Wei Chen, Tonghua Su

TL;DR

This work addresses one-shot face animation where a single reference image is animated under driving signals, tackling expressiveness, continuity, and background stability without identity-specific fine-tuning. It refines an Image2Video framework by adding Face Locator and Motion Frame, and trains on CelebV-HQ and HDTF with CLIP guidance to boost realism. The method builds on AnimateAnyone, using a Reference Net, Denoising UNet, and a Pose Guider to translate pose sequences from DWPose or Audio2Pose into latent facial motion. A Gradio-based demo platform provides Input2Pose and Image2Video pipelines for quick, user-friendly generation, with potential applications in education and personal assistance and scope for future real-time optimization.

Abstract

The objective of face animation is to generate dynamic and expressive talking head videos from a single reference face, utilizing driving conditions derived from either video or audio inputs. Current approaches often require fine-tuning for specific identities and frequently fail to produce expressive videos due to the limited effectiveness of Wav2Pose modules. To facilitate the generation of one-shot and more consecutive talking head videos, we refine an existing Image2Video model by integrating a Face Locator and Motion Frame mechanism. We subsequently optimize the model using extensive human face video datasets, significantly enhancing its ability to produce high-quality and expressive talking head videos. Additionally, we develop a demo platform using the Gradio framework, which streamlines the process, enabling users to quickly create customized talking head videos.

One-Shot Pose-Driving Face Animation Platform

TL;DR

This work addresses one-shot face animation where a single reference image is animated under driving signals, tackling expressiveness, continuity, and background stability without identity-specific fine-tuning. It refines an Image2Video framework by adding Face Locator and Motion Frame, and trains on CelebV-HQ and HDTF with CLIP guidance to boost realism. The method builds on AnimateAnyone, using a Reference Net, Denoising UNet, and a Pose Guider to translate pose sequences from DWPose or Audio2Pose into latent facial motion. A Gradio-based demo platform provides Input2Pose and Image2Video pipelines for quick, user-friendly generation, with potential applications in education and personal assistance and scope for future real-time optimization.

Abstract

The objective of face animation is to generate dynamic and expressive talking head videos from a single reference face, utilizing driving conditions derived from either video or audio inputs. Current approaches often require fine-tuning for specific identities and frequently fail to produce expressive videos due to the limited effectiveness of Wav2Pose modules. To facilitate the generation of one-shot and more consecutive talking head videos, we refine an existing Image2Video model by integrating a Face Locator and Motion Frame mechanism. We subsequently optimize the model using extensive human face video datasets, significantly enhancing its ability to produce high-quality and expressive talking head videos. Additionally, we develop a demo platform using the Gradio framework, which streamlines the process, enabling users to quickly create customized talking head videos.
Paper Structure (4 sections, 1 figure)

This paper contains 4 sections, 1 figure.

Figures (1)

  • Figure 1: Selected running results of our platform.