Table of Contents
Fetching ...

X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention

Xiaochen Zhao, Hongyi Xu, Guoxian Song, You Xie, Chenxu Zhang, Xiu Li, Linjie Luo, Jinli Suo, Yebin Liu

TL;DR

X-NeMo addresses zero-shot portrait animation by disentangling motion from identity through an end-to-end learned 1D latent motion descriptor, controlled via cross-attention in a diffusion backbone. By avoiding spatially aligned conditioning and incorporating a dual-head latent supervision with targeted augmentations, it mitigates identity leakage while enhancing expressiveness for subtle and extreme expressions. The approach achieves state-of-the-art performance in both self and cross reenactment across diverse identities and enables motion interpolation and video outpainting. Extensive ablations validate the design choices and demonstrate robust generalization, with code and models released for research.

Abstract

We propose X-NeMo, a novel zero-shot diffusion-based portrait animation pipeline that animates a static portrait using facial movements from a driving video of a different individual. Our work first identifies the root causes of the key issues in prior approaches, such as identity leakage and difficulty in capturing subtle and extreme expressions. To address these challenges, we introduce a fully end-to-end training framework that distills a 1D identity-agnostic latent motion descriptor from driving image, effectively controlling motion through cross-attention during image generation. Our implicit motion descriptor captures expressive facial motion in fine detail, learned end-to-end from a diverse video dataset without reliance on pretrained motion detectors. We further enhance expressiveness and disentangle motion latents from identity cues by supervising their learning with a dual GAN decoder, alongside spatial and color augmentations. By embedding the driving motion into a 1D latent vector and controlling motion via cross-attention rather than additive spatial guidance, our design eliminates the transmission of spatial-aligned structural clues from the driving condition to the diffusion backbone, substantially mitigating identity leakage. Extensive experiments demonstrate that X-NeMo surpasses state-of-the-art baselines, producing highly expressive animations with superior identity resemblance. Our code and models are available for research.

X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention

TL;DR

X-NeMo addresses zero-shot portrait animation by disentangling motion from identity through an end-to-end learned 1D latent motion descriptor, controlled via cross-attention in a diffusion backbone. By avoiding spatially aligned conditioning and incorporating a dual-head latent supervision with targeted augmentations, it mitigates identity leakage while enhancing expressiveness for subtle and extreme expressions. The approach achieves state-of-the-art performance in both self and cross reenactment across diverse identities and enables motion interpolation and video outpainting. Extensive ablations validate the design choices and demonstrate robust generalization, with code and models released for research.

Abstract

We propose X-NeMo, a novel zero-shot diffusion-based portrait animation pipeline that animates a static portrait using facial movements from a driving video of a different individual. Our work first identifies the root causes of the key issues in prior approaches, such as identity leakage and difficulty in capturing subtle and extreme expressions. To address these challenges, we introduce a fully end-to-end training framework that distills a 1D identity-agnostic latent motion descriptor from driving image, effectively controlling motion through cross-attention during image generation. Our implicit motion descriptor captures expressive facial motion in fine detail, learned end-to-end from a diverse video dataset without reliance on pretrained motion detectors. We further enhance expressiveness and disentangle motion latents from identity cues by supervising their learning with a dual GAN decoder, alongside spatial and color augmentations. By embedding the driving motion into a 1D latent vector and controlling motion via cross-attention rather than additive spatial guidance, our design eliminates the transmission of spatial-aligned structural clues from the driving condition to the diffusion backbone, substantially mitigating identity leakage. Extensive experiments demonstrate that X-NeMo surpasses state-of-the-art baselines, producing highly expressive animations with superior identity resemblance. Our code and models are available for research.

Paper Structure

This paper contains 36 sections, 6 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: We present X-NeMo, a diffusion-based portrait animation framework that integrates expressive 1D latent motion descriptors with identity-disentangled motion control through cross-attention mechanisms (left). Our method enables meticulous transfer of expressive head poses and detailed facial expressions while maintaining identity consistency, even across subjects with distinct appearances, styles and facial structures (right).
  • Figure 2: Overview of X-NeMo. We leverage a pretrained diffusion model $\mathcal{D}$ as the rendering backbone and incorporate a reference network module $\mathcal{R}$ for appearance conditioning, along with temporal modules for cross-frame consistency. For motion control, we train a latent motion embedding $f_{mot}$ encoded from the driving image $I_D$ after applying spatial and color augmentations. Alongside the relative translation and scaling $f_{rts}$ of the face bounding box from reference $I_R$ and driving image $I_D$, we integrate the latent motion conditions into the diffusion backbone using newly inserted cross-attention layers. Besides the original diffusion loss $L_{ldm}$, we supervise the learning of our latent motion embedding with a jointly trained GAN decoder head using image-level losses $L_{gan}.$ During inference, we derive the latent motion codes directly from each driving frame, allowing us to synthesize expressive and precise animations while strictly maintain identity resemblance to the reference image.
  • Figure 3: Qualitative ablation study on factors affecting identity consistency. (a) Replacing our motion cross-attentions with a control module using spatially additive guidance leads to severe leakage of the driving identity's facial structure. (b) Training without our color and spatial augmentations results in noticeable appearance leakage and identity drift.
  • Figure 4: Qualitative ablation study on factors influencing motion expressiveness.(a) Without the dual GAN head, training solely with the diffusion loss hinders the motion encoder's ability to learn detailed and local motion patterns. (b) Our reference feature masking (RFM) strategy facilitates the transfer of fine-level facial expressions, such as the wrinkles at the nasal region.
  • Figure 5: Qualitative comparisons. Among all the methods, X-NeMo achieves the most accurate transfer of intricate expressions and emotional subtleties while demonstrating the highest identity resemblance, regardless of the characteristic differences between the reference and driving identities.
  • ...and 7 more figures