T2LM: Long-Term 3D Human Motion Generation from Multiple Sentences
Taeryung Lee, Fabien Baradel, Thomas Lucas, Kyoung Mu Lee, Gregory Rogez
TL;DR
This work tackles long-term 3D human motion generation from streams of natural language descriptions. It introduces T2LM, a non-recurrent framework that combines a 1D-convolutional VQVAE for discrete motion latent representations with a Transformer-based Text Encoder to map text to these latents, enabling training without sequential data. By decoding a continuous stream of latent vectors through a 1D convolutional decoder, T2LM achieves smooth transitions across actions and scales to long sequences. Empirical results on HumanML3D and BABEL show state-of-the-art performance for long-term generation and competitive results for single-action tasks, highlighting the practicality of continuous latent-based generation for text-conditioned motion synthesis.
Abstract
In this paper, we address the challenging problem of long-term 3D human motion generation. Specifically, we aim to generate a long sequence of smoothly connected actions from a stream of multiple sentences (i.e., paragraph). Previous long-term motion generating approaches were mostly based on recurrent methods, using previously generated motion chunks as input for the next step. However, this approach has two drawbacks: 1) it relies on sequential datasets, which are expensive; 2) these methods yield unrealistic gaps between motions generated at each step. To address these issues, we introduce simple yet effective T2LM, a continuous long-term generation framework that can be trained without sequential data. T2LM comprises two components: a 1D-convolutional VQVAE, trained to compress motion to sequences of latent vectors, and a Transformer-based Text Encoder that predicts a latent sequence given an input text. At inference, a sequence of sentences is translated into a continuous stream of latent vectors. This is then decoded into a motion by the VQVAE decoder; the use of 1D convolutions with a local temporal receptive field avoids temporal inconsistencies between training and generated sequences. This simple constraint on the VQ-VAE allows it to be trained with short sequences only and produces smoother transitions. T2LM outperforms prior long-term generation models while overcoming the constraint of requiring sequential data; it is also competitive with SOTA single-action generation models.
