Table of Contents
Fetching ...

T2LM: Long-Term 3D Human Motion Generation from Multiple Sentences

Taeryung Lee, Fabien Baradel, Thomas Lucas, Kyoung Mu Lee, Gregory Rogez

TL;DR

This work tackles long-term 3D human motion generation from streams of natural language descriptions. It introduces T2LM, a non-recurrent framework that combines a 1D-convolutional VQVAE for discrete motion latent representations with a Transformer-based Text Encoder to map text to these latents, enabling training without sequential data. By decoding a continuous stream of latent vectors through a 1D convolutional decoder, T2LM achieves smooth transitions across actions and scales to long sequences. Empirical results on HumanML3D and BABEL show state-of-the-art performance for long-term generation and competitive results for single-action tasks, highlighting the practicality of continuous latent-based generation for text-conditioned motion synthesis.

Abstract

In this paper, we address the challenging problem of long-term 3D human motion generation. Specifically, we aim to generate a long sequence of smoothly connected actions from a stream of multiple sentences (i.e., paragraph). Previous long-term motion generating approaches were mostly based on recurrent methods, using previously generated motion chunks as input for the next step. However, this approach has two drawbacks: 1) it relies on sequential datasets, which are expensive; 2) these methods yield unrealistic gaps between motions generated at each step. To address these issues, we introduce simple yet effective T2LM, a continuous long-term generation framework that can be trained without sequential data. T2LM comprises two components: a 1D-convolutional VQVAE, trained to compress motion to sequences of latent vectors, and a Transformer-based Text Encoder that predicts a latent sequence given an input text. At inference, a sequence of sentences is translated into a continuous stream of latent vectors. This is then decoded into a motion by the VQVAE decoder; the use of 1D convolutions with a local temporal receptive field avoids temporal inconsistencies between training and generated sequences. This simple constraint on the VQ-VAE allows it to be trained with short sequences only and produces smoother transitions. T2LM outperforms prior long-term generation models while overcoming the constraint of requiring sequential data; it is also competitive with SOTA single-action generation models.

T2LM: Long-Term 3D Human Motion Generation from Multiple Sentences

TL;DR

This work tackles long-term 3D human motion generation from streams of natural language descriptions. It introduces T2LM, a non-recurrent framework that combines a 1D-convolutional VQVAE for discrete motion latent representations with a Transformer-based Text Encoder to map text to these latents, enabling training without sequential data. By decoding a continuous stream of latent vectors through a 1D convolutional decoder, T2LM achieves smooth transitions across actions and scales to long sequences. Empirical results on HumanML3D and BABEL show state-of-the-art performance for long-term generation and competitive results for single-action tasks, highlighting the practicality of continuous latent-based generation for text-conditioned motion synthesis.

Abstract

In this paper, we address the challenging problem of long-term 3D human motion generation. Specifically, we aim to generate a long sequence of smoothly connected actions from a stream of multiple sentences (i.e., paragraph). Previous long-term motion generating approaches were mostly based on recurrent methods, using previously generated motion chunks as input for the next step. However, this approach has two drawbacks: 1) it relies on sequential datasets, which are expensive; 2) these methods yield unrealistic gaps between motions generated at each step. To address these issues, we introduce simple yet effective T2LM, a continuous long-term generation framework that can be trained without sequential data. T2LM comprises two components: a 1D-convolutional VQVAE, trained to compress motion to sequences of latent vectors, and a Transformer-based Text Encoder that predicts a latent sequence given an input text. At inference, a sequence of sentences is translated into a continuous stream of latent vectors. This is then decoded into a motion by the VQVAE decoder; the use of 1D convolutions with a local temporal receptive field avoids temporal inconsistencies between training and generated sequences. This simple constraint on the VQ-VAE allows it to be trained with short sequences only and produces smoother transitions. T2LM outperforms prior long-term generation models while overcoming the constraint of requiring sequential data; it is also competitive with SOTA single-action generation models.
Paper Structure (16 sections, 6 equations, 5 figures, 7 tables)

This paper contains 16 sections, 6 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Visual result. We present a qualitative example obtained from our long-term motion generator. A stream of input texts is used to condition our model and produce a matching continuous motion.
  • Figure 2: Overview of T2LM. We present the overview of our test-time generation. From the stream of textual descriptions and desired lengths of each action, we produce a smooth long-term motion corresponding to the text stream.
  • Figure 2: VQVAE architecture. We present the architecture of our VQVAE. Both the encoder and the decoder are built with convolutional layers.
  • Figure 3: Text Encoder architecture. We present the architecture of Text Encoder. A first test encoder injects information about the text and length embeddings into a sequence of tokens, and a second autoregressive model predicts the latent sequence.
  • Figure 4: Qualitative result. We provide visualizations of generated long-term motions obtained with our method. The first, second, and third actions are rendered in blue, purple, and brown, respectively. This is a video figure that is best viewed by Adobe Reader.