Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

Zichen Geng; Caren Han; Zeeshan Hayder; Jian Liu; Mubarak Shah; Ajmal Mian

Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

Zichen Geng, Caren Han, Zeeshan Hayder, Jian Liu, Mubarak Shah, Ajmal Mian

TL;DR

KeyMotion introduces a keyframe-first framework for text-driven 3D human motion generation. It combines a KL-regularized VAE to map keyframes to a latent space, a Parallel Skip Transformer for text-conditioned latent diffusion, and a Motion Masked AutoEncoder to fill in-between frames while preserving physical constraints. The method achieves state-of-the-art results on HumanML3D (Top R-precision and MultiModal metrics) and competitive results on KITML (Top3 R-precision, FID, Diversity), with real-time inference. Overall, KeyMotion reduces computation while preserving fidelity by exploiting keyframe-level representations and cross-modal diffusion, enabling scalable, text-guided animation and robotics applications.

Abstract

Text-driven human motion generation is an emerging task in animation and humanoid robot design. Existing algorithms directly generate the full sequence which is computationally expensive and prone to errors as it does not pay special attention to key poses, a process that has been the cornerstone of animation for decades. We propose KeyMotion, that generates plausible human motion sequences corresponding to input text by first generating keyframes followed by in-filling. We use a Variational Autoencoder (VAE) with Kullback-Leibler regularization to project the keyframes into a latent space to reduce dimensionality and further accelerate the subsequent diffusion process. For the reverse diffusion, we propose a novel Parallel Skip Transformer that performs cross-modal attention between the keyframe latents and text condition. To complete the motion sequence, we propose a text-guided Transformer designed to perform motion-in-filling, ensuring the preservation of both fidelity and adherence to the physical constraints of human motion. Experiments show that our method achieves state-of-theart results on the HumanML3D dataset outperforming others on all R-precision metrics and MultiModal Distance. KeyMotion also achieves competitive performance on the KIT dataset, achieving the best results on Top3 R-precision, FID, and Diversity metrics.

Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

TL;DR

Abstract

Paper Structure (26 sections, 14 equations, 8 figures, 12 tables)

This paper contains 26 sections, 14 equations, 8 figures, 12 tables.

Introduction
Related Works
KeyMotion Method
Keyframe Selection
Keyframe VAE (Variational Autoencoder)
Parallel Skip Transformer Denoiser
Motion Masked AutoEncoder
Experiments
Implementation Details
Text-to-Motion Generation Results
Qualitative Comparison
Comparison with Multi-Stage Reconstruction
Ablation Studies
Conclusion
Ablation Study on Keyframe Selection Method
...and 11 more sections

Figures (8)

Figure 1: Generated keyframes by our method with input texts: (A) a person walks forward carefully placing one foot directly in front of the other; (B) a person walks forward, bends forward, walks backward; (C) a man takes his hands puts them on his hips and jumps up and down. (D) a person walks forward and then sits.
Figure 2: KeyMotion illustration. (1) Keyframe VAE encodes ($\mathcal{E}$) / decodes ($\mathcal{D}$) the keyframes in latent space. Keyframes in latent space are diffused to Gaussian noise. (2) Parallel Skip Transformer performs reverse diffusion in latent space using input text as conditioning. Denoised latent keyframes are decoded back to human motion space. (3) MMAE, a text-guided Transformer, performs keyframe infilling to produce the full sequence.
Figure 3: Parallel Skip Transformer and two-stream cross attention Transformer layer for denoiser module. By crossing attention operations between the text condition and latent variables, our model learns more stable textual information. $T$ is the timestep of the diffusion process, TRM is the basic Transformer, $H_c^{(i)}, H_l^{(i)}$ is the condition and latent hidden space through $i$-th cross attention Transformer.
Figure 4: Comparison for our KeyMotion with other multi-stage models. (A) A person is slowly tip-toeing down a path while stretching his arms to balance himself. (B) The person was flying around like a fly. (C) The figure throws the basketball and then catches it. (D) A man sits on the chair clapping. (E) A man is walking and keeps jumping to avoid something on the ground.
Figure 5: FID over epochs
...and 3 more figures

Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

TL;DR

Abstract

Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (8)