Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

Ashkan Taghipour; Morteza Ghahremani; Zinuo Li; Hamid Laga; Farid Boussaid; Mohammed Bennamoun

Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

Ashkan Taghipour, Morteza Ghahremani, Zinuo Li, Hamid Laga, Farid Boussaid, Mohammed Bennamoun

TL;DR

A Blender-based synthetic dataset containing 2,000 videos with diverse characters performing acrobatic and stunt-like motions is introduced, which fills an important gap because existing benchmarks significantly under-represent acrobatic motions while web-collected datasets raise copyright and privacy concerns.

Abstract

Generating videos of complex human motions such as flips, cartwheels, and martial arts remains challenging for current video diffusion models. Text-only conditioning is temporally ambiguous for fine-grained motion control, while explicit pose-based controls, though effective, require users to provide complete skeleton sequences that are costly to produce for long and dynamic actions. We propose a two-stage cascaded framework that addresses both limitations. First, an autoregressive text-to-skeleton model generates 2D pose sequences from natural language descriptions by predicting each joint conditioned on previously generated poses. This design captures long-range temporal dependencies and inter-joint coordination required for complex motions. Second, a pose-conditioned video diffusion model synthesizes videos from a reference image and the generated skeleton sequence. It employs DINO-ALF (Adaptive Layer Fusion), a multi-level reference encoder that preserves appearance and clothing details under large pose changes and self-occlusions. To address the lack of publicly available datasets for complex human motion video generation, we introduce a Blender-based synthetic dataset containing 2,000 videos with diverse characters performing acrobatic and stunt-like motions. The dataset provides full control over appearance, motion, and environment. It fills an important gap because existing benchmarks significantly under-represent acrobatic motions while web-collected datasets raise copyright and privacy concerns. Experiments on our synthetic dataset and the Motion-X Fitness benchmark show that our text-to-skeleton model outperforms prior methods on FID, R-precision, and motion diversity. Our pose-to-video model also achieves the best results among all compared methods on VBench metrics for temporal consistency, motion smoothness, and subject preservation.

Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

TL;DR

Abstract

Paper Structure (28 sections, 20 equations, 8 figures, 6 tables)

This paper contains 28 sections, 20 equations, 8 figures, 6 tables.

Introduction
Related Work
Pose-Conditioned Human Video Generation
Text-to-Skeleton as Motion Control for Video
Proposed Method
Text-to-Skeleton Generation
Pose Representation and Tokenization
Text conditioning
Autoregressive Decoder
Training objective
Inference
Pose-Conditioned Video Generation
Latent video diffusion backbone
DINO-ALF: Adaptive Layer Fusion for Appearance Encoding
Replacing native reference cross-attention with DINO cross-attention
...and 13 more sections

Figures (8)

Figure 1: Overview of the text-to-skeleton generation architecture for training. A text prompt is encoded and prepended as a conditioning prefix to the pose token sequence. The autoregressive Transformer predicts each joint token conditioned on all previously generated tokens and the text description.
Figure 2: Overview of the text-to-skeleton generation architecture for inference.
Figure 3: Overview of the pose-conditioned video generation architecture. A reference image is encoded via DINO-ALF to produce appearance tokens, while the skeleton sequence is rasterized and encoded by a 3D CNN into spatiotemporally aligned motion tokens. Both conditioning streams are injected into the DiT denoiser to synthesize the output video.
Figure 4: Patch-feature magnitude maps ($\ell_2$-norm) across DINOv3 layers. Earlier layers exhibit high activation on the subject and texture-rich regions, while later layers show more uniform magnitudes. This motivates using an early layer as the query for adaptive layer fusion.
Figure 5: Cross-attention maps for CLIP (top) vs. DINO-ALF (bottom) on a backflip sequence. DINO-ALF attends more precisely to the moving subject, while CLIP attention is scattered.
...and 3 more figures

Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

TL;DR

Abstract

Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

Authors

TL;DR

Abstract

Table of Contents

Figures (8)