Table of Contents
Fetching ...

MoMask: Generative Masked Modeling of 3D Human Motions

Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, Li Cheng

TL;DR

MoMask tackles text-to-motion generation by introducing a masked generative framework that uses hierarchical residual quantization to produce multi-layer motion tokens. A base-layer Masked Transformer generates core tokens from text, while a Residual Transformer progressively adds higher-order residual tokens, enabling high-fidelity and efficient synthesis with a limited number of iterations. Empirical results on HumanML3D and KIT-ML show state-of-the-art FID and strong text-motion alignment, with additional support for temporal inpainting without additional fine-tuning. This approach offers a scalable, controllable pathway for text-driven 3D motion generation with practical impact for games, metaverse, and VR/AR applications.

Abstract

We introduce MoMask, a novel masked modeling framework for text-driven 3D human motion generation. In MoMask, a hierarchical quantization scheme is employed to represent human motion as multi-layer discrete motion tokens with high-fidelity details. Starting at the base layer, with a sequence of motion tokens obtained by vector quantization, the residual tokens of increasing orders are derived and stored at the subsequent layers of the hierarchy. This is consequently followed by two distinct bidirectional transformers. For the base-layer motion tokens, a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at training stage. During generation (i.e. inference) stage, starting from an empty sequence, our Masked Transformer iteratively fills up the missing tokens; Subsequently, a Residual Transformer learns to progressively predict the next-layer tokens based on the results from current layer. Extensive experiments demonstrate that MoMask outperforms the state-of-art methods on the text-to-motion generation task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset, and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly applied in related tasks without further model fine-tuning, such as text-guided temporal inpainting.

MoMask: Generative Masked Modeling of 3D Human Motions

TL;DR

MoMask tackles text-to-motion generation by introducing a masked generative framework that uses hierarchical residual quantization to produce multi-layer motion tokens. A base-layer Masked Transformer generates core tokens from text, while a Residual Transformer progressively adds higher-order residual tokens, enabling high-fidelity and efficient synthesis with a limited number of iterations. Empirical results on HumanML3D and KIT-ML show state-of-the-art FID and strong text-motion alignment, with additional support for temporal inpainting without additional fine-tuning. This approach offers a scalable, controllable pathway for text-driven 3D motion generation with practical impact for games, metaverse, and VR/AR applications.

Abstract

We introduce MoMask, a novel masked modeling framework for text-driven 3D human motion generation. In MoMask, a hierarchical quantization scheme is employed to represent human motion as multi-layer discrete motion tokens with high-fidelity details. Starting at the base layer, with a sequence of motion tokens obtained by vector quantization, the residual tokens of increasing orders are derived and stored at the subsequent layers of the hierarchy. This is consequently followed by two distinct bidirectional transformers. For the base-layer motion tokens, a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at training stage. During generation (i.e. inference) stage, starting from an empty sequence, our Masked Transformer iteratively fills up the missing tokens; Subsequently, a Residual Transformer learns to progressively predict the next-layer tokens based on the results from current layer. Extensive experiments demonstrate that MoMask outperforms the state-of-art methods on the text-to-motion generation task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset, and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly applied in related tasks without further model fine-tuning, such as text-guided temporal inpainting.
Paper Structure (12 sections, 5 equations, 7 figures, 2 tables)

This paper contains 12 sections, 5 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Our MoMask, when provided with a text input, generates high-quality 3D human motion with diversity and precise control over subtleties such as "two strides forward", "pivot on left foot", and "pivot swiftly".
  • Figure 2: Approach overview. (a) Motion sequence is tokenized through vector quantization (VQ), also referred to as the base quantization layer, as well as a hierarchy of multiple layers for residual quantization. (b) Parallel prediction by the Masked Transformer: the tokens in the base layer $t^0$ are randomly masked out with a variable rate, and then a text-conditioned masked transformer is trained to predict the masked tokens in the sequence simultaneously. (c) Layer-by-layer progressive prediction by the Residual Transformer. A text-conditioned residual transformer learns to progressively predict the residual tokens $t^{j>0}$ from the tokens in previous layers, $t^{0:j-1}$.
  • Figure 3: Inference process. Starting from an empty sequence $t^0(0)$, the M-Transformer generates the base-layer token sequence $t^0$ in $L$ iterations. Following this, the R-Transformer progressively predicts the rest-layer token sequences $t^{2:V}$ within $V-1$ steps.
  • Figure 4: Visual comparisons between the different methods given three distinct text descriptions from HumanML3D testset. Only key frames are displayed. Compared to previous methods, MoMask generates motions with higher quality and better understanding of the subtle language concepts such as "stumble", "sneak", "walk sideways". Please refer to the demo video for complete motion clips.
  • Figure 5: (a) Comparison of inference time costs. All tests are conducted on the same Nvidia2080Ti. The closer the model is to the origin, the better. (b) User study results on the HumanML3D dataset. Each bar represents the preference rate of MoMask over the compared model. Overall, MoMask is preferred over the other models most of the time. The dashed line marks 50%.
  • ...and 2 more figures