Table of Contents
Fetching ...

HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions

Shuolin Xu, Siming Zheng, Ziyi Wang, HC Yu, Jinwei Chen, Huaqi Zhang, Bo Li, Peng-Tao Jiang

TL;DR

The paper tackles the challenge of generating pose-guided human image animations under highly dynamic, non-standard motions (Hypermotion). It introduces Open-HyperMotionX and HyperMotionX Bench to provide high-quality pose annotations and a rigorous evaluation platform for complex motion scenarios, and presents a simple DiT-based baseline with Latent Composition for conditional control combined with a Spatial Low-Frequency Enhanced RoPE module. The proposed SLF-RoPE strengthens low-frequency spatial modeling, improving structural stability and appearance in challenging sequences, while a Wavelet-based temporal windowing method helps isolate representative action segments. Experimental results show consistent gains in pixel-level fidelity and competitive video-level metrics, establishing the dataset and method as practical tools for advancing complex motion generation.

Abstract

Recent advances in diffusion models have significantly improved conditional video generation, particularly in the pose-guided human image animation task. Although existing methods are capable of generating high-fidelity and time-consistent animation sequences in regular motions and static scenes, there are still obvious limitations when facing complex human body motions (Hypermotion) that contain highly dynamic, non-standard motions, and the lack of a high-quality benchmark for evaluation of complex human motion animations. To address this challenge, we introduce the \textbf{Open-HyperMotionX Dataset} and \textbf{HyperMotionX Bench}, which provide high-quality human pose annotations and curated video clips for evaluating and improving pose-guided human image animation models under complex human motion conditions. Furthermore, we propose a simple yet powerful DiT-based video generation baseline and design spatial low-frequency enhanced RoPE, a novel module that selectively enhances low-frequency spatial feature modeling by introducing learnable frequency scaling. Our method significantly improves structural stability and appearance consistency in highly dynamic human motion sequences. Extensive experiments demonstrate the effectiveness of our dataset and proposed approach in advancing the generation quality of complex human motion image animations. Code and dataset will be made publicly available.

HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions

TL;DR

The paper tackles the challenge of generating pose-guided human image animations under highly dynamic, non-standard motions (Hypermotion). It introduces Open-HyperMotionX and HyperMotionX Bench to provide high-quality pose annotations and a rigorous evaluation platform for complex motion scenarios, and presents a simple DiT-based baseline with Latent Composition for conditional control combined with a Spatial Low-Frequency Enhanced RoPE module. The proposed SLF-RoPE strengthens low-frequency spatial modeling, improving structural stability and appearance in challenging sequences, while a Wavelet-based temporal windowing method helps isolate representative action segments. Experimental results show consistent gains in pixel-level fidelity and competitive video-level metrics, establishing the dataset and method as practical tools for advancing complex motion generation.

Abstract

Recent advances in diffusion models have significantly improved conditional video generation, particularly in the pose-guided human image animation task. Although existing methods are capable of generating high-fidelity and time-consistent animation sequences in regular motions and static scenes, there are still obvious limitations when facing complex human body motions (Hypermotion) that contain highly dynamic, non-standard motions, and the lack of a high-quality benchmark for evaluation of complex human motion animations. To address this challenge, we introduce the \textbf{Open-HyperMotionX Dataset} and \textbf{HyperMotionX Bench}, which provide high-quality human pose annotations and curated video clips for evaluating and improving pose-guided human image animation models under complex human motion conditions. Furthermore, we propose a simple yet powerful DiT-based video generation baseline and design spatial low-frequency enhanced RoPE, a novel module that selectively enhances low-frequency spatial feature modeling by introducing learnable frequency scaling. Our method significantly improves structural stability and appearance consistency in highly dynamic human motion sequences. Extensive experiments demonstrate the effectiveness of our dataset and proposed approach in advancing the generation quality of complex human motion image animations. Code and dataset will be made publicly available.

Paper Structure

This paper contains 23 sections, 9 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Complex human motion animation samples generated by our method. We present generation examples in both landscape ($1024 \times 576$) and portrait ($576 \times 1024$) resolutions.
  • Figure 2: Sample video frames of previous methods. Comparison under separate high-quality and low-quality pose guidance.
  • Figure 3: The overview of our Hypermotion framework. The model takes a reference image and a driving pose video as inputs and generates human animation. Pose control and reference image are injected via latent composition and guided by a binary mask. Spatial Low-Frequency Enhanced RoPE is applied in self-attention.
  • Figure 4: Examples from our HyperMotionX Bench. We contribute high-quality pose sequence annotations that contain a diversity of complex motion videos as well as different types of characters, including adults, children, and different styles of videos covering both real and cartoon scenes.
  • Figure 5: Qualitative comparison between our method and previous state-of-the-art methods.Our method demonstrates superior structural coherence, appearance consistency, and motion stability under complex human motion such as front flip. Results are shown in both 1024×576 (landscape) and 576×1024 (portrait) resolutions.
  • ...and 2 more figures