Table of Contents
Fetching ...

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

Yuxuan Luo, Zhengkun Rong, Lizhen Wang, Longhao Zhang, Tianshu Hu, Yongming Zhu

TL;DR

DreamActor-M1 introduces a diffusion-transformer framework with hybrid motion guidance (implicit facial representations, 3D head spheres, 3D body skeletons), complementary appearance guidance, and progressive training to achieve fine-grained, multi-scale, and temporally coherent human image animation. By decoupling facial expressions, head pose, and body motion and incorporating pseudo-references for unseen regions, it delivers robust long-term consistency across portraits to full-body scenes. The three-stage training regime and a diverse 500-hour dataset enable effective generalization, while ablations validate the importance of each component. The approach advances expressive, identity-preserving animation with improved robustness for real-world deployment, albeit with acknowledged limitations in camera movement and object interactions and with attention to ethical concerns.

Abstract

While recent image-based human animation methods achieve realistic body and facial motion synthesis, critical gaps remain in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence, which leads to their lower expressiveness and robustness. We propose a diffusion transformer (DiT) based framework, DreamActor-M1, with hybrid guidance to overcome these limitations. For motion guidance, our hybrid control signals that integrate implicit facial representations, 3D head spheres, and 3D body skeletons achieve robust control of facial expressions and body movements, while producing expressive and identity-preserving animations. For scale adaptation, to handle various body poses and image scales ranging from portraits to full-body views, we employ a progressive training strategy using data with varying resolutions and scales. For appearance guidance, we integrate motion patterns from sequential frames with complementary visual references, ensuring long-term temporal coherence for unseen regions during complex movements. Experiments demonstrate that our method outperforms the state-of-the-art works, delivering expressive results for portraits, upper-body, and full-body generation with robust long-term consistency. Project Page: https://grisoon.github.io/DreamActor-M1/.

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

TL;DR

DreamActor-M1 introduces a diffusion-transformer framework with hybrid motion guidance (implicit facial representations, 3D head spheres, 3D body skeletons), complementary appearance guidance, and progressive training to achieve fine-grained, multi-scale, and temporally coherent human image animation. By decoupling facial expressions, head pose, and body motion and incorporating pseudo-references for unseen regions, it delivers robust long-term consistency across portraits to full-body scenes. The three-stage training regime and a diverse 500-hour dataset enable effective generalization, while ablations validate the importance of each component. The approach advances expressive, identity-preserving animation with improved robustness for real-world deployment, albeit with acknowledged limitations in camera movement and object interactions and with attention to ethical concerns.

Abstract

While recent image-based human animation methods achieve realistic body and facial motion synthesis, critical gaps remain in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence, which leads to their lower expressiveness and robustness. We propose a diffusion transformer (DiT) based framework, DreamActor-M1, with hybrid guidance to overcome these limitations. For motion guidance, our hybrid control signals that integrate implicit facial representations, 3D head spheres, and 3D body skeletons achieve robust control of facial expressions and body movements, while producing expressive and identity-preserving animations. For scale adaptation, to handle various body poses and image scales ranging from portraits to full-body views, we employ a progressive training strategy using data with varying resolutions and scales. For appearance guidance, we integrate motion patterns from sequential frames with complementary visual references, ensuring long-term temporal coherence for unseen regions during complex movements. Experiments demonstrate that our method outperforms the state-of-the-art works, delivering expressive results for portraits, upper-body, and full-body generation with robust long-term consistency. Project Page: https://grisoon.github.io/DreamActor-M1/.

Paper Structure

This paper contains 14 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We introduce DreamActor-M1, a DiT-based human animation framework, with hybrid guidance to achieve fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence.
  • Figure 2: Overview of DreamActor-M1. During the training stage, we first extract body skeletons and head spheres from driving frames and then encode them to the pose latent using the pose encoder. The resultant pose latent is combined with the noised video latent along the channel dimension. The video latent is obtained by encoding a clip from the input full video using 3D VAE. Facial expression is additionally encoded by the face motion encoder, to generate implicit facial representations. Note that the reference image can be one or multiple frames sampled from the input video to provide additional appearance details during training and the reference token branch shares weights of our DiT model with the noise token branch. Finally, the denoised video latent is supervised by the encoded video latent. Within each DiT block, the face motion token is integrated into the noise token branch via cross-attention (Face Attn), while appearance information of ref token is injected to noise token through concatenated self-attention (Self Attn) and subsequent cross-attention (Ref Attn).
  • Figure 3: Overview of our inference pipeline. First, we (optionally) generate multiple pseudo-references to provide complementary appearance information. Next, we extract hybrid control signals comprising implicit facial motion and explicit poses (head sphere and body skeleton) from the driving video. Finally, these signals are injected into a DiT model to synthesize animated human videos. Our framework decouples facial motion from body poses, with facial motion signals being alternatively derivable from speech inputs.
  • Figure 4: The comparisons with human image animation works, Animate Anyone hu2024animate, Champ zhu2024champ, MimicMotion zhang2024mimicmotion and DisPose li2024dispose. Our method demonstrates results with better fine-grained motions, identity preservation, temporal consistency and high fidelity.
  • Figure 5: Our comparisons with portrait image animation works, LivePortrait guo2024liveportrait, X-Portrait xie2024x, Skyreels-A1 qiu2025skyreels and Runway Act-One runwayactone. Our method demonstrates more accurate and expressive portrait animation capabilities.
  • ...and 1 more figures