Table of Contents
Fetching ...

UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer

Xiang Wang, Shiwei Zhang, Longxiang Tang, Yingya Zhang, Changxin Gao, Yuehuan Wang, Nong Sang

TL;DR

The paper tackles the challenge of generating temporally coherent, high-fidelity human animations guided by driving poses. It introduces UniAnimate-DiT, a Wan2.1-based diffusion-transformer framework that uses LoRA fine-tuning plus lightweight pose encoders to condition on motion and reference pose/appearance. Key contributions include memory-efficient fine-tuning, pose-aware conditioning, and patch-level integration of reference information, enabling 480p training with 720p inference and long-video generation through sliding windows. Qualitative results show strong visual fidelity and temporal consistency, and the authors provide open-source code for broader adoption and development.

Abstract

This report presents UniAnimate-DiT, an advanced project that leverages the cutting-edge and powerful capabilities of the open-source Wan2.1 model for consistent human image animation. Specifically, to preserve the robust generative capabilities of the original Wan2.1 model, we implement Low-Rank Adaptation (LoRA) technique to fine-tune a minimal set of parameters, significantly reducing training memory overhead. A lightweight pose encoder consisting of multiple stacked 3D convolutional layers is designed to encode motion information of driving poses. Furthermore, we adopt a simple concatenation operation to integrate the reference appearance into the model and incorporate the pose information of the reference image for enhanced pose alignment. Experimental results show that our approach achieves visually appearing and temporally consistent high-fidelity animations. Trained on 480p (832x480) videos, UniAnimate-DiT demonstrates strong generalization capabilities to seamlessly upscale to 720P (1280x720) during inference. The training and inference code is publicly available at https://github.com/ali-vilab/UniAnimate-DiT.

UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer

TL;DR

The paper tackles the challenge of generating temporally coherent, high-fidelity human animations guided by driving poses. It introduces UniAnimate-DiT, a Wan2.1-based diffusion-transformer framework that uses LoRA fine-tuning plus lightweight pose encoders to condition on motion and reference pose/appearance. Key contributions include memory-efficient fine-tuning, pose-aware conditioning, and patch-level integration of reference information, enabling 480p training with 720p inference and long-video generation through sliding windows. Qualitative results show strong visual fidelity and temporal consistency, and the authors provide open-source code for broader adoption and development.

Abstract

This report presents UniAnimate-DiT, an advanced project that leverages the cutting-edge and powerful capabilities of the open-source Wan2.1 model for consistent human image animation. Specifically, to preserve the robust generative capabilities of the original Wan2.1 model, we implement Low-Rank Adaptation (LoRA) technique to fine-tune a minimal set of parameters, significantly reducing training memory overhead. A lightweight pose encoder consisting of multiple stacked 3D convolutional layers is designed to encode motion information of driving poses. Furthermore, we adopt a simple concatenation operation to integrate the reference appearance into the model and incorporate the pose information of the reference image for enhanced pose alignment. Experimental results show that our approach achieves visually appearing and temporally consistent high-fidelity animations. Trained on 480p (832x480) videos, UniAnimate-DiT demonstrates strong generalization capabilities to seamlessly upscale to 720P (1280x720) during inference. The training and inference code is publicly available at https://github.com/ali-vilab/UniAnimate-DiT.

Paper Structure

This paper contains 6 sections, 3 figures.

Figures (3)

  • Figure 1: Image animation examples synthesized by the proposed UniAnimate-DiT with Wan2.1-I2V-14B wang2025wanvideo as the base model.
  • Figure 3: Video cases synthesized by the proposed UniAnimate-DiT.
  • Figure 4: Video cases synthesized by the proposed UniAnimate-DiT.