X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention
You Xie, Hongyi Xu, Guoxian Song, Chao Wang, Yichun Shi, Linjie Luo
TL;DR
X-Portrait addresses expressive portrait animation by leveraging a frozen latent diffusion backbone augmented with three trainable modules for appearance, motion, and temporal coherence. It introduces a cross-identity training scheme that derives motion directly from driving RGB inputs via a pre-trained reenactment network to produce a cross-identity control image $I_C$, plus a local patch $I^l_C$ to sharpen motion attention, and employs random scaling to mitigate appearance leakage; inference runs directly on the reference portrait without requiring $\mathcal{F}$. The approach yields strong identity preservation, rich motion expressiveness, and robust generalization to out-of-domain and stylized portraits, outperforming state-of-the-art baselines in both self and cross reenactment benchmarks. This zero-shot framework demonstrates practical potential for high-fidelity, controllable portrait animation without fine-tuning, enabling broad applications in video synthesis and digital avatars.
Abstract
We propose X-Portrait, an innovative conditional diffusion model tailored for generating expressive and temporally coherent portrait animation. Specifically, given a single portrait as appearance reference, we aim to animate it with motion derived from a driving video, capturing both highly dynamic and subtle facial expressions along with wide-range head movements. As its core, we leverage the generative prior of a pre-trained diffusion model as the rendering backbone, while achieve fine-grained head pose and expression control with novel controlling signals within the framework of ControlNet. In contrast to conventional coarse explicit controls such as facial landmarks, our motion control module is learned to interpret the dynamics directly from the original driving RGB inputs. The motion accuracy is further enhanced with a patch-based local control module that effectively enhance the motion attention to small-scale nuances like eyeball positions. Notably, to mitigate the identity leakage from the driving signals, we train our motion control modules with scaling-augmented cross-identity images, ensuring maximized disentanglement from the appearance reference modules. Experimental results demonstrate the universal effectiveness of X-Portrait across a diverse range of facial portraits and expressive driving sequences, and showcase its proficiency in generating captivating portrait animations with consistently maintained identity characteristics.
