Table of Contents
Fetching ...

HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation

Qijun Gan, Yi Ren, Chen Zhang, Zhenhui Ye, Pan Xie, Xiang Yin, Zehuan Yuan, Bingyue Peng, Jianke Zhu

TL;DR

HumanDiT addresses the challenge of generating high-fidelity, long-form human motion videos with accurate hands and faces and strong temporal coherence. It introduces a Diffusion Transformer-based framework with a prefix-latent reference strategy, RoPE-enabled variable resolutions, and a Pose Guidance module to convert pose information into tokens, complemented by Keypoint-DiT and a pose adapter for video continuation and pose transfer. The approach is trained on a large-scale 14,000-hour in-the-wild dataset and demonstrates superior quantitative and qualitative performance across diverse scenarios, with ablations confirming the importance of token size, reference strategy, and pose refinement. This work advances flexible, high-quality human video generation with practical implications for virtual humans, animation, and long-form video synthesis, while acknowledging computational demands and current limits in extreme poses.

Abstract

Human motion video generation has advanced significantly, while existing methods still struggle with accurately rendering detailed body parts like hands and faces, especially in long sequences and intricate motions. Current approaches also rely on fixed resolution and struggle to maintain visual consistency. To address these limitations, we propose HumanDiT, a pose-guided Diffusion Transformer (DiT)-based framework trained on a large and wild dataset containing 14,000 hours of high-quality video to produce high-fidelity videos with fine-grained body rendering. Specifically, (i) HumanDiT, built on DiT, supports numerous video resolutions and variable sequence lengths, facilitating learning for long-sequence video generation; (ii) we introduce a prefix-latent reference strategy to maintain personalized characteristics across extended sequences. Furthermore, during inference, HumanDiT leverages Keypoint-DiT to generate subsequent pose sequences, facilitating video continuation from static images or existing videos. It also utilizes a Pose Adapter to enable pose transfer with given sequences. Extensive experiments demonstrate its superior performance in generating long-form, pose-accurate videos across diverse scenarios.

HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation

TL;DR

HumanDiT addresses the challenge of generating high-fidelity, long-form human motion videos with accurate hands and faces and strong temporal coherence. It introduces a Diffusion Transformer-based framework with a prefix-latent reference strategy, RoPE-enabled variable resolutions, and a Pose Guidance module to convert pose information into tokens, complemented by Keypoint-DiT and a pose adapter for video continuation and pose transfer. The approach is trained on a large-scale 14,000-hour in-the-wild dataset and demonstrates superior quantitative and qualitative performance across diverse scenarios, with ablations confirming the importance of token size, reference strategy, and pose refinement. This work advances flexible, high-quality human video generation with practical implications for virtual humans, animation, and long-form video synthesis, while acknowledging computational demands and current limits in extreme poses.

Abstract

Human motion video generation has advanced significantly, while existing methods still struggle with accurately rendering detailed body parts like hands and faces, especially in long sequences and intricate motions. Current approaches also rely on fixed resolution and struggle to maintain visual consistency. To address these limitations, we propose HumanDiT, a pose-guided Diffusion Transformer (DiT)-based framework trained on a large and wild dataset containing 14,000 hours of high-quality video to produce high-fidelity videos with fine-grained body rendering. Specifically, (i) HumanDiT, built on DiT, supports numerous video resolutions and variable sequence lengths, facilitating learning for long-sequence video generation; (ii) we introduce a prefix-latent reference strategy to maintain personalized characteristics across extended sequences. Furthermore, during inference, HumanDiT leverages Keypoint-DiT to generate subsequent pose sequences, facilitating video continuation from static images or existing videos. It also utilizes a Pose Adapter to enable pose transfer with given sequences. Extensive experiments demonstrate its superior performance in generating long-form, pose-accurate videos across diverse scenarios.

Paper Structure

This paper contains 24 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: HumanDiT is a framework designed to generate high-fidelity, long human motion videos in diverse scenes and flexible resolution.
  • Figure 2: The overview of HumanDiT. HumanDiT focuses on generate videos from a single image using a pose-guided DiT model. A 3D VAE is employed to encode video segments into latent space. With 3D full attention, the initial frame (green border) serves as a noise-free prefix latent (green cube) for reference. The pose guider extracts body and background pose features, while the DiT-based denoising model renders the final pixel results. During inference, the keypoint-DiT model produces subsequent motions based on the pose of the first frame. With a guided pose sequence, the pose adapter transfers and refines poses via keypoint-DiT to animate the reference image.
  • Figure 3: Qualitative comparison. Our approach outperforms others in rendering quality and pose accuracy.
  • Figure 4: The template pose-driven human rendering results of HumanDiT on the Flux flux generated images.
  • Figure 5: The video continuation with generated motions.
  • ...and 3 more figures