Table of Contents
Fetching ...

T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations

Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen

TL;DR

This paper tackles text-driven human motion generation by leveraging a two-stage framework: Motion VQ-VAE to learn a discrete motion representation and a GPT-based prior (T2M-GPT) to autoregressively generate the discrete codes from textual descriptions. It addresses codebook collapse with EMA and Code Reset, and mitigates train-testing discrepancy in the GPT by corrupting training sequences. Using CLIP-conditioned text embeddings and an End token to signal motion end, the method achieves competitive or superior results to diffusion-based models on HumanML3D and KIT-ML, notably with a strong FID performance on HumanML3D. The study also analyzes quantization strategies and data requirements, showing that larger datasets could further improve results, while confirming that discrete representations remain a viable, simpler alternative for motion generation.

Abstract

In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations. For GPT, we incorporate a simple corruption strategy during the training to alleviate training-testing discrepancy. Despite its simplicity, our T2M-GPT shows better performance than competitive approaches, including recent diffusion-based approaches. For example, on HumanML3D, which is currently the largest dataset, we achieve comparable performance on the consistency between text and generated motion (R-Precision), but with FID 0.116 largely outperforming MotionDiffuse of 0.630. Additionally, we conduct analyses on HumanML3D and observe that the dataset size is a limitation of our approach. Our work suggests that VQ-VAE still remains a competitive approach for human motion generation.

T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations

TL;DR

This paper tackles text-driven human motion generation by leveraging a two-stage framework: Motion VQ-VAE to learn a discrete motion representation and a GPT-based prior (T2M-GPT) to autoregressively generate the discrete codes from textual descriptions. It addresses codebook collapse with EMA and Code Reset, and mitigates train-testing discrepancy in the GPT by corrupting training sequences. Using CLIP-conditioned text embeddings and an End token to signal motion end, the method achieves competitive or superior results to diffusion-based models on HumanML3D and KIT-ML, notably with a strong FID performance on HumanML3D. The study also analyzes quantization strategies and data requirements, showing that larger datasets could further improve results, while confirming that discrete representations remain a viable, simpler alternative for motion generation.

Abstract

In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations. For GPT, we incorporate a simple corruption strategy during the training to alleviate training-testing discrepancy. Despite its simplicity, our T2M-GPT shows better performance than competitive approaches, including recent diffusion-based approaches. For example, on HumanML3D, which is currently the largest dataset, we achieve comparable performance on the consistency between text and generated motion (R-Precision), but with FID 0.116 largely outperforming MotionDiffuse of 0.630. Additionally, we conduct analyses on HumanML3D and observe that the dataset size is a limitation of our approach. Our work suggests that VQ-VAE still remains a competitive approach for human motion generation.
Paper Structure (40 sections, 9 equations, 5 figures, 9 tables)

This paper contains 40 sections, 9 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Visual results on HumanML3D guo2022generating. Our approach is able to generate precise and high-quality human motion consistent with challenging text descriptions. More visual results are on the https://mael-zys.github.io/T2M-GPT/.
  • Figure 2: Overview of our framework for text-driven motion generation. It includes two modules: Motion VQ-VAE (Figure \ref{['fig:vqvae']}) and T2M-GPT (Figure \ref{['fig:transformer']}). In T2M-GPT, an additional learnable $\mathit{End}$ token is inserted to indicate the stop of the generation. During the inference, we first generate code indexes in an auto-regressive fashion and then obtain the motion using the decoder in Motion VQ-VAE.
  • Figure 3: Architecture of the motion VQ-VAE. We use a standard CNN-based architecture with 1D convolution (Conv1D), residual block (ResBlock) and ReLU activation. '$L$' denotes the number of residual blocks. We use convolution with stride 2 and nearest interpolation for temporal downsampling and upsampling.
  • Figure 4: Visual results on HumanML3D guo2022generating dataset. We compare our generation with Guo et al.guo2022generating, MotionDiffuse zhang2022motiondiffuse, and MDM tevet2022MDM. Distorted motions (red) and sliding (yellow) are highlighted. More visual results can be found on the https://mael-zys.github.io/T2M-GPT/.
  • Figure 5: Impact of dataset size on HumanML3D guo2022generating. We train our motion VQ-VAE (Reconstruction) and T2M-GPT (Generation) on the subsets of HumanML3D guo2022generating composed of 10%, 20%, 50%, 80%, and 100% training set respectively. All the models are evaluated on the entire test set. We report FID, MM-Dist, Top-1, and Top-3 accuracy for all the models. Results suggest that our model might benefit from more training data.