Table of Contents
Fetching ...

TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts

Chuan Guo, Xinxin Zuo, Sen Wang, Li Cheng

TL;DR

TM2T tackles bidirectional generation between 3D human motions and text by introducing motion tokens—a discrete, vector-quantized representation learned from data. The framework uses autoregressive NMT to map between motion tokens and text tokens and employs an inverse-alignment mechanism to regularize text2motion with motion2text, enabling non-deterministic, variable-length motion generation from text. Evaluations on HumanML3D and KIT-ML show state-of-the-art performance for both text2motion and motion2text, with extensive quantitative metrics and human studies. Limitations include quantization artifacts and challenges with long descriptions, suggesting future joint training and improved codebook optimization.

Abstract

Inspired by the strong ties between vision and language, the two intimate human sensing and communication modalities, our paper aims to explore the generation of 3D human full-body motions from texts, as well as its reciprocal task, shorthanded for text2motion and motion2text, respectively. To tackle the existing challenges, especially to enable the generation of multiple distinct motions from the same text, and to avoid the undesirable production of trivial motionless pose sequences, we propose the use of motion token, a discrete and compact motion representation. This provides one level playing ground when considering both motions and text signals, as the motion and text tokens, respectively. Moreover, our motion2text module is integrated into the inverse alignment process of our text2motion training pipeline, where a significant deviation of synthesized text from the input text would be penalized by a large training loss; empirically this is shown to effectively improve performance. Finally, the mappings in-between the two modalities of motions and texts are facilitated by adapting the neural model for machine translation (NMT) to our context. This autoregressive modeling of the distribution over discrete motion tokens further enables non-deterministic production of pose sequences, of variable lengths, from an input text. Our approach is flexible, could be used for both text2motion and motion2text tasks. Empirical evaluations on two benchmark datasets demonstrate the superior performance of our approach on both tasks over a variety of state-of-the-art methods. Project page: https://ericguo5513.github.io/TM2T/

TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts

TL;DR

TM2T tackles bidirectional generation between 3D human motions and text by introducing motion tokens—a discrete, vector-quantized representation learned from data. The framework uses autoregressive NMT to map between motion tokens and text tokens and employs an inverse-alignment mechanism to regularize text2motion with motion2text, enabling non-deterministic, variable-length motion generation from text. Evaluations on HumanML3D and KIT-ML show state-of-the-art performance for both text2motion and motion2text, with extensive quantitative metrics and human studies. Limitations include quantization artifacts and challenges with long descriptions, suggesting future joint training and improved codebook optimization.

Abstract

Inspired by the strong ties between vision and language, the two intimate human sensing and communication modalities, our paper aims to explore the generation of 3D human full-body motions from texts, as well as its reciprocal task, shorthanded for text2motion and motion2text, respectively. To tackle the existing challenges, especially to enable the generation of multiple distinct motions from the same text, and to avoid the undesirable production of trivial motionless pose sequences, we propose the use of motion token, a discrete and compact motion representation. This provides one level playing ground when considering both motions and text signals, as the motion and text tokens, respectively. Moreover, our motion2text module is integrated into the inverse alignment process of our text2motion training pipeline, where a significant deviation of synthesized text from the input text would be penalized by a large training loss; empirically this is shown to effectively improve performance. Finally, the mappings in-between the two modalities of motions and texts are facilitated by adapting the neural model for machine translation (NMT) to our context. This autoregressive modeling of the distribution over discrete motion tokens further enables non-deterministic production of pose sequences, of variable lengths, from an input text. Our approach is flexible, could be used for both text2motion and motion2text tasks. Empirical evaluations on two benchmark datasets demonstrate the superior performance of our approach on both tasks over a variety of state-of-the-art methods. Project page: https://ericguo5513.github.io/TM2T/
Paper Structure (29 sections, 7 equations, 9 figures, 3 tables)

This paper contains 29 sections, 7 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: An illustration of our bidirectional TM2T approach that captures the interplay between text (left) and 3D human motion (right) through the text2motion and motion2text modules. Note the stochastic nature of our text2motion module allows the generation of different 3D motions from the same textural description.
  • Figure 2: Approach overview. (a) A 1D CNN based latent quantization model is firstly learned to reconstruct training motions. After training, a motion can be subsequently converted to a tuple of discrete motion tokens (i.e., codebook-indices). [BOM] and [EOM] are indicators of start and end added in a motion token sequence. (b-c) Mappings between motion and text tokens are modeled by autoregressive NMT networks and optimized by maximizing the log-likelihood of the targets ($\mathscr{L}_{NLL}$ and $\mathscr{L}_{NLL}^m$). (c) While training text2motion, motion tokens sampled from the resulting discrete distributions are inversely mapped to the text space via the learned motion2text model. Loss $\mathscr{L}_{NLL}^t$ penalizes the inverse alignment error. Finally, the 3D pose sequence is obtained by decoding motion tokens via the decoder $\mathrm{D}$ in (a).
  • Figure 3: Exemplar results of motion tokens (middle) and their corresponding pose sequences (top and bottom). Here two 24-frame pose sequence examples are presented; each is reconstructed from a motion token sequences of size 6. Each motion token is associated with a specific local spatial-temporal context, visualized in 4-frame motions.
  • Figure 4: Statistics of human preference amongst the generated descriptions for given human motions. For each method, a color bar (from blue to red) indicated the the percentage of its preference level (from least to most preferred).
  • Figure 5: Examples of motion-to-text translation results from different approaches. Grammatical tense and plural of words are not considered for simplifying learning process. More results are provided in supplementary files.
  • ...and 4 more figures