Table of Contents
Fetching ...

X-Dyna: Expressive Dynamic Human Image Animation

Di Chang, Hongyi Xu, You Xie, Yipeng Gao, Zhengfei Kuang, Shengqu Cai, Chenxu Zhang, Guoxian Song, Chao Wang, Yichun Shi, Zeyuan Chen, Shijie Zhou, Linjie Luo, Gordon Wetzstein, Mohammad Soleymani

TL;DR

X-Dyna presents a zero-shot diffusion-based pipeline for animating a single human image using a driving video, addressing the loss of dynamic details by introducing a lightweight Dynamics-Adapter that injects reference appearance into spatial attentions without harming motion synthesis. A local face control module enables identity-disentangled facial expressions, while Harmonic Data Fusion Training blends human and natural-scene videos to learn both subject dynamics and background motion. The approach achieves state-of-the-art results in pose transfer, expression accuracy, and dynamic realism, demonstrated through extensive quantitative metrics and a user study. Together, these innovations enable more lifelike, context-aware human video animations with robust background and environmental dynamics."

Abstract

We introduce X-Dyna, a novel zero-shot, diffusion-based pipeline for animating a single human image using facial expressions and body movements derived from a driving video, that generates realistic, context-aware dynamics for both the subject and the surrounding environment. Building on prior approaches centered on human pose control, X-Dyna addresses key shortcomings causing the loss of dynamic details, enhancing the lifelike qualities of human video animations. At the core of our approach is the Dynamics-Adapter, a lightweight module that effectively integrates reference appearance context into the spatial attentions of the diffusion backbone while preserving the capacity of motion modules in synthesizing fluid and intricate dynamic details. Beyond body pose control, we connect a local control module with our model to capture identity-disentangled facial expressions, facilitating accurate expression transfer for enhanced realism in animated scenes. Together, these components form a unified framework capable of learning physical human motion and natural scene dynamics from a diverse blend of human and scene videos. Comprehensive qualitative and quantitative evaluations demonstrate that X-Dyna outperforms state-of-the-art methods, creating highly lifelike and expressive animations. The code is available at https://github.com/bytedance/X-Dyna.

X-Dyna: Expressive Dynamic Human Image Animation

TL;DR

X-Dyna presents a zero-shot diffusion-based pipeline for animating a single human image using a driving video, addressing the loss of dynamic details by introducing a lightweight Dynamics-Adapter that injects reference appearance into spatial attentions without harming motion synthesis. A local face control module enables identity-disentangled facial expressions, while Harmonic Data Fusion Training blends human and natural-scene videos to learn both subject dynamics and background motion. The approach achieves state-of-the-art results in pose transfer, expression accuracy, and dynamic realism, demonstrated through extensive quantitative metrics and a user study. Together, these innovations enable more lifelike, context-aware human video animations with robust background and environmental dynamics."

Abstract

We introduce X-Dyna, a novel zero-shot, diffusion-based pipeline for animating a single human image using facial expressions and body movements derived from a driving video, that generates realistic, context-aware dynamics for both the subject and the surrounding environment. Building on prior approaches centered on human pose control, X-Dyna addresses key shortcomings causing the loss of dynamic details, enhancing the lifelike qualities of human video animations. At the core of our approach is the Dynamics-Adapter, a lightweight module that effectively integrates reference appearance context into the spatial attentions of the diffusion backbone while preserving the capacity of motion modules in synthesizing fluid and intricate dynamic details. Beyond body pose control, we connect a local control module with our model to capture identity-disentangled facial expressions, facilitating accurate expression transfer for enhanced realism in animated scenes. Together, these components form a unified framework capable of learning physical human motion and natural scene dynamics from a diverse blend of human and scene videos. Comprehensive qualitative and quantitative evaluations demonstrate that X-Dyna outperforms state-of-the-art methods, creating highly lifelike and expressive animations. The code is available at https://github.com/bytedance/X-Dyna.
Paper Structure (21 sections, 11 equations, 6 figures, 6 tables)

This paper contains 21 sections, 11 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Sampled animations generated by our X-Dyna, including zero-shot motion transfer, and dynamic human image animation with moving or static humans.
  • Figure 2: We leverage a pretrained diffusion UNet backbone for controlled human image animation, enabling expressive dynamic details and precise motion control. Specifically, we introduce a dynamics adapter $D$ that seamlessly integrates the reference image context as a trainable residual to the spatial attention, in parallel with the denoising process, while preserving the original spatial and temporal attention mechanisms within the UNet. In addition to body pose control via a ControlNet $C_P$ , we introduce a local face control module $C_F$ that implicitly learns facial expression control from a synthesized cross-identity face patch. We train our model on a diverse dataset of human motion videos and natural scene videos simultaneously. Our model achieves remarkable transfer of body poses and facial expressions, as well as highly vivid and detailed dynamics for both the human and the scene.
  • Figure 3: a) IP-Adapter ye2023ip can generate vivid texture from the reference image but fails to preserve the appearance. b) Though ReferenceNet hu2024animate can preserve the identity from the human reference, it generates a static background without any dynamics. c) Dynamics-Adapter provides both expressive details and consistent identities.
  • Figure 4: a) IP-Adapter ye2023ip encodes the reference image as an image CLIP embedding and injects the information into the cross-attention layers in SD as the residual. b) ReferenceNet hu2024animate is a trainable parallel UNet and feeds the semantic information into SD via concatenation of self-attention features. c) Dynamics-Adapter encodes the reference image with a partially shared-weight UNet. The appearance control is realized by learning a residual in the self-attention with trainable query and output linear layers. All other components share the same frozen weight with SD.
  • Figure 5: Qualitative Comparison on Human in Dynamic Scene. While existing SOTA methods struggle to generate consistent and realistic scene dynamics involving humans, our method successfully produces dynamic human-scene interactions while preserving the structure of the reference image.
  • ...and 1 more figures