T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations
Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen
TL;DR
This paper tackles text-driven human motion generation by leveraging a two-stage framework: Motion VQ-VAE to learn a discrete motion representation and a GPT-based prior (T2M-GPT) to autoregressively generate the discrete codes from textual descriptions. It addresses codebook collapse with EMA and Code Reset, and mitigates train-testing discrepancy in the GPT by corrupting training sequences. Using CLIP-conditioned text embeddings and an End token to signal motion end, the method achieves competitive or superior results to diffusion-based models on HumanML3D and KIT-ML, notably with a strong FID performance on HumanML3D. The study also analyzes quantization strategies and data requirements, showing that larger datasets could further improve results, while confirming that discrete representations remain a viable, simpler alternative for motion generation.
Abstract
In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations. For GPT, we incorporate a simple corruption strategy during the training to alleviate training-testing discrepancy. Despite its simplicity, our T2M-GPT shows better performance than competitive approaches, including recent diffusion-based approaches. For example, on HumanML3D, which is currently the largest dataset, we achieve comparable performance on the consistency between text and generated motion (R-Precision), but with FID 0.116 largely outperforming MotionDiffuse of 0.630. Additionally, we conduct analyses on HumanML3D and observe that the dataset size is a limitation of our approach. Our work suggests that VQ-VAE still remains a competitive approach for human motion generation.
