Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer
Zichen Geng, Caren Han, Zeeshan Hayder, Jian Liu, Mubarak Shah, Ajmal Mian
TL;DR
KeyMotion introduces a keyframe-first framework for text-driven 3D human motion generation. It combines a KL-regularized VAE to map keyframes to a latent space, a Parallel Skip Transformer for text-conditioned latent diffusion, and a Motion Masked AutoEncoder to fill in-between frames while preserving physical constraints. The method achieves state-of-the-art results on HumanML3D (Top R-precision and MultiModal metrics) and competitive results on KITML (Top3 R-precision, FID, Diversity), with real-time inference. Overall, KeyMotion reduces computation while preserving fidelity by exploiting keyframe-level representations and cross-modal diffusion, enabling scalable, text-guided animation and robotics applications.
Abstract
Text-driven human motion generation is an emerging task in animation and humanoid robot design. Existing algorithms directly generate the full sequence which is computationally expensive and prone to errors as it does not pay special attention to key poses, a process that has been the cornerstone of animation for decades. We propose KeyMotion, that generates plausible human motion sequences corresponding to input text by first generating keyframes followed by in-filling. We use a Variational Autoencoder (VAE) with Kullback-Leibler regularization to project the keyframes into a latent space to reduce dimensionality and further accelerate the subsequent diffusion process. For the reverse diffusion, we propose a novel Parallel Skip Transformer that performs cross-modal attention between the keyframe latents and text condition. To complete the motion sequence, we propose a text-guided Transformer designed to perform motion-in-filling, ensuring the preservation of both fidelity and adherence to the physical constraints of human motion. Experiments show that our method achieves state-of-theart results on the HumanML3D dataset outperforming others on all R-precision metrics and MultiModal Distance. KeyMotion also achieves competitive performance on the KIT dataset, achieving the best results on Top3 R-precision, FID, and Diversity metrics.
