Table of Contents
Fetching ...

HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation

Zunnan Xu, Zhentao Yu, Zixiang Zhou, Jun Zhou, Xiaoyu Jin, Fa-Ting Hong, Xiaozhong Ji, Junwei Zhu, Chengfei Cai, Shiyu Tang, Qin Lin, Xiu Li, Qinglin Lu

TL;DR

This work tackles portrait animation by introducing an implicit condition control framework built on stable video diffusion to decouple motion from appearance. It combines a fine-grained appearance extractor with an intensity-aware motion extractor, enhanced by a motion memory bank and IMAdapter to inject motion and identity signals through cross-attention into the denoising network, avoiding fine-tuning of the diffusion model. The approach achieves state-of-the-art temporal consistency and controllability in both self- and cross-reenactment, while generalizing across styles and facial geometries. It enables lifelike portrait animation with high fidelity and paves the way for practical applications in VR, gaming, and virtual avatars, while acknowledging ethical considerations and proposing safeguards and future extensions for broader articulation and efficiency.

Abstract

We introduce HunyuanPortrait, a diffusion-based condition control method that employs implicit representations for highly controllable and lifelike portrait animation. Given a single portrait image as an appearance reference and video clips as driving templates, HunyuanPortrait can animate the character in the reference image by the facial expression and head pose of the driving videos. In our framework, we utilize pre-trained encoders to achieve the decoupling of portrait motion information and identity in videos. To do so, implicit representation is adopted to encode motion information and is employed as control signals in the animation phase. By leveraging the power of stable video diffusion as the main building block, we carefully design adapter layers to inject control signals into the denoising unet through attention mechanisms. These bring spatial richness of details and temporal consistency. HunyuanPortrait also exhibits strong generalization performance, which can effectively disentangle appearance and motion under different image styles. Our framework outperforms existing methods, demonstrating superior temporal consistency and controllability. Our project is available at https://kkakkkka.github.io/HunyuanPortrait.

HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation

TL;DR

This work tackles portrait animation by introducing an implicit condition control framework built on stable video diffusion to decouple motion from appearance. It combines a fine-grained appearance extractor with an intensity-aware motion extractor, enhanced by a motion memory bank and IMAdapter to inject motion and identity signals through cross-attention into the denoising network, avoiding fine-tuning of the diffusion model. The approach achieves state-of-the-art temporal consistency and controllability in both self- and cross-reenactment, while generalizing across styles and facial geometries. It enables lifelike portrait animation with high fidelity and paves the way for practical applications in VR, gaming, and virtual avatars, while acknowledging ethical considerations and proposing safeguards and future extensions for broader articulation and efficiency.

Abstract

We introduce HunyuanPortrait, a diffusion-based condition control method that employs implicit representations for highly controllable and lifelike portrait animation. Given a single portrait image as an appearance reference and video clips as driving templates, HunyuanPortrait can animate the character in the reference image by the facial expression and head pose of the driving videos. In our framework, we utilize pre-trained encoders to achieve the decoupling of portrait motion information and identity in videos. To do so, implicit representation is adopted to encode motion information and is employed as control signals in the animation phase. By leveraging the power of stable video diffusion as the main building block, we carefully design adapter layers to inject control signals into the denoising unet through attention mechanisms. These bring spatial richness of details and temporal consistency. HunyuanPortrait also exhibits strong generalization performance, which can effectively disentangle appearance and motion under different image styles. Our framework outperforms existing methods, demonstrating superior temporal consistency and controllability. Our project is available at https://kkakkkka.github.io/HunyuanPortrait.

Paper Structure

This paper contains 24 sections, 5 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Our framework employs implicit condition control to generate portrait animations, demonstrating robust generalization performance with high-fidelity facial dynamics and vivid head poses during cross-reenactment. The animated portraits remain unaffected by variations in facial shape and the spatial position of the driving videos, demonstrating strong identity consistency.
  • Figure 2: Our framework utilizes implicit representation to encode motion information as control signals. By harnessing the capabilities of stable video diffusion as the primary building block, we have meticulously designed a fine-grained appearance extractor to maintain the identity of the portrait, along with an intensity-aware motion extractor to capture intricate facial dynamics.
  • Figure 3: The illustration of our ID-aware Multi-scale Adapter (IMAdapter). Here, ⓒ represents the operation of concatenating features along the channel dimension.
  • Figure 4: Qualitative comparisons of self-reenactment and cross-reenactment with state-of-the-art methods.
  • Figure 5: Qualitative ablation studies of different components.
  • ...and 5 more figures