Table of Contents
Fetching ...

X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention

You Xie, Hongyi Xu, Guoxian Song, Chao Wang, Yichun Shi, Linjie Luo

TL;DR

X-Portrait addresses expressive portrait animation by leveraging a frozen latent diffusion backbone augmented with three trainable modules for appearance, motion, and temporal coherence. It introduces a cross-identity training scheme that derives motion directly from driving RGB inputs via a pre-trained reenactment network to produce a cross-identity control image $I_C$, plus a local patch $I^l_C$ to sharpen motion attention, and employs random scaling to mitigate appearance leakage; inference runs directly on the reference portrait without requiring $\mathcal{F}$. The approach yields strong identity preservation, rich motion expressiveness, and robust generalization to out-of-domain and stylized portraits, outperforming state-of-the-art baselines in both self and cross reenactment benchmarks. This zero-shot framework demonstrates practical potential for high-fidelity, controllable portrait animation without fine-tuning, enabling broad applications in video synthesis and digital avatars.

Abstract

We propose X-Portrait, an innovative conditional diffusion model tailored for generating expressive and temporally coherent portrait animation. Specifically, given a single portrait as appearance reference, we aim to animate it with motion derived from a driving video, capturing both highly dynamic and subtle facial expressions along with wide-range head movements. As its core, we leverage the generative prior of a pre-trained diffusion model as the rendering backbone, while achieve fine-grained head pose and expression control with novel controlling signals within the framework of ControlNet. In contrast to conventional coarse explicit controls such as facial landmarks, our motion control module is learned to interpret the dynamics directly from the original driving RGB inputs. The motion accuracy is further enhanced with a patch-based local control module that effectively enhance the motion attention to small-scale nuances like eyeball positions. Notably, to mitigate the identity leakage from the driving signals, we train our motion control modules with scaling-augmented cross-identity images, ensuring maximized disentanglement from the appearance reference modules. Experimental results demonstrate the universal effectiveness of X-Portrait across a diverse range of facial portraits and expressive driving sequences, and showcase its proficiency in generating captivating portrait animations with consistently maintained identity characteristics.

X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention

TL;DR

X-Portrait addresses expressive portrait animation by leveraging a frozen latent diffusion backbone augmented with three trainable modules for appearance, motion, and temporal coherence. It introduces a cross-identity training scheme that derives motion directly from driving RGB inputs via a pre-trained reenactment network to produce a cross-identity control image , plus a local patch to sharpen motion attention, and employs random scaling to mitigate appearance leakage; inference runs directly on the reference portrait without requiring . The approach yields strong identity preservation, rich motion expressiveness, and robust generalization to out-of-domain and stylized portraits, outperforming state-of-the-art baselines in both self and cross reenactment benchmarks. This zero-shot framework demonstrates practical potential for high-fidelity, controllable portrait animation without fine-tuning, enabling broad applications in video synthesis and digital avatars.

Abstract

We propose X-Portrait, an innovative conditional diffusion model tailored for generating expressive and temporally coherent portrait animation. Specifically, given a single portrait as appearance reference, we aim to animate it with motion derived from a driving video, capturing both highly dynamic and subtle facial expressions along with wide-range head movements. As its core, we leverage the generative prior of a pre-trained diffusion model as the rendering backbone, while achieve fine-grained head pose and expression control with novel controlling signals within the framework of ControlNet. In contrast to conventional coarse explicit controls such as facial landmarks, our motion control module is learned to interpret the dynamics directly from the original driving RGB inputs. The motion accuracy is further enhanced with a patch-based local control module that effectively enhance the motion attention to small-scale nuances like eyeball positions. Notably, to mitigate the identity leakage from the driving signals, we train our motion control modules with scaling-augmented cross-identity images, ensuring maximized disentanglement from the appearance reference modules. Experimental results demonstrate the universal effectiveness of X-Portrait across a diverse range of facial portraits and expressive driving sequences, and showcase its proficiency in generating captivating portrait animations with consistently maintained identity characteristics.
Paper Structure (21 sections, 1 equation, 7 figures, 2 tables)

This paper contains 21 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Given a single reference portrait (left column), X-Portrait is capable of synthesizing compelling and expressive animations (right columns), covering large head pose changes and highly dynamic and detailed facial expression features from the input driving videos. Notably, X-Portrait can faithfully preserve the identity information from the reference portrait while transferring the expression subtleties across a wide range of portrait styles. Please see the supplementary video for more dynamic results and https://github.com/bytedance/X-Portrait/tree/main for released code and models. ⓒGeorge E. Koronaios, Dmitriy Ganin and Zacleonardi.
  • Figure 2: Overview of X-Portrait . For the task of portrait animation, X-Portrait leverages a frozen pre-trained LDM as a rendering backbone, and incorporates three auxiliary trainable modules for disentangled control of appearance $\mathcal{R}$, motion $\mathcal{C}$ and temporal smoothness $\mathcal{M}$. Specifically, $\mathcal{R}$ extracts the source appearance and background context from a reference image $I_S,$ and $\mathcal{C}$ derives the motion of head pose and facial expression from a driving frame $I_D.$ During training, we leverage a pre-trained network $\mathcal{F}$ to generate cross-identity control images $I_C$ as conditional input to our control modules $\mathcal{C}.$ To better capture subtle expressions, we enhance the attention to the local detailed facial movements with an additional masked control image $I_C^l.$ Both $I_C$ and $I_C^l$ are subject to random heterogeneous scaling for mitigation of appearance leakage from the drivings. For inference, we animate a source portrait directly with the video frames without any pre-processing, enabling expressive and robust animation with strictly maintained identity resemblance. ⓒTima Miroshnichenko.
  • Figure 3: Ablation. (a) Training with scaling-augmented ground-truth driving image (with the same identity as reference) as motion condition results in severe identity appearance leakage from the driving. (b) Our local motion control module effectively enhances the capture of subtle and detailed local facial movements. (c) Random scaling with the conditional control images improves the identity preservation (note the head shape and eye sizes). ⓒArchonom, Artbyhoussam, Cottonbro Studio and Ketut Subiyanto.
  • Figure 4: Qualitative comparisons. Among all the methods, X-Portrait achieves the most accurate and robust transfer of both subtle and extreme facial expressions (e.g., pouting and single-eye blinks) and wide-range head translations and rotations, with precise identity resemblance (e.g., face shapes, eyes/mouth sizes) to the reference even with artistic styles (e.g., pencil drawing and anime). ⓒArtbyhoussam, Midjourney.com and Katiin Bolovtsova.
  • Figure 5: X-Portrait exhibits limited transfer of expressions when $\mathcal{F}$ completely fails to produce any correlated motion clues (e.g., turning lips inwards in the first row and puffing cheeks in the second row).
  • ...and 2 more figures