Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation
Dongjie Fu
TL;DR
Mogo introduces a GPT-type architecture that delivers high-quality text-to-3D motion generation with streaming output and strong generalization. It fuses RVQ-VAE-based motion tokenization with a single Hierarchical Causal Transformer to jointly generate base motions and layer-wise residuals, enabling continuous sequences up to 13 seconds and real-time inference. Across HumanML3D, KIT-ML, and CMP, Mogo achieves state-of-the-art results among GPT-type models and competitive performance against BERT-type approaches, including zero-shot CMP evaluation and a user-preference advantage. The approach demonstrates practical impact for open-vocabulary prompts and interactive media applications, supported by prompt engineering and rigorous ablations on codebook design and transformer architecture.
Abstract
In the field of text-to-motion generation, Bert-type Masked Models (MoMask, MMM) currently produce higher-quality outputs compared to GPT-type autoregressive models (T2M-GPT). However, these Bert-type models often lack the streaming output capability required for applications in video game and multimedia environments, a feature inherent to GPT-type models. Additionally, they demonstrate weaker performance in out-of-distribution generation. To surpass the quality of BERT-type models while leveraging a GPT-type structure, without adding extra refinement models that complicate scaling data, we propose a novel architecture, Mogo (Motion Only Generate Once), which generates high-quality lifelike 3D human motions by training a single transformer model. Mogo consists of only two main components: 1) RVQ-VAE, a hierarchical residual vector quantization variational autoencoder, which discretizes continuous motion sequences with high precision; 2) Hierarchical Causal Transformer, responsible for generating the base motion sequences in an autoregressive manner while simultaneously inferring residuals across different layers. Experimental results demonstrate that Mogo can generate continuous and cyclic motion sequences up to 260 frames (13 seconds), surpassing the 196 frames (10 seconds) length limitation of existing datasets like HumanML3D. On the HumanML3D test set, Mogo achieves a FID score of 0.079, outperforming both the GPT-type model T2M-GPT (FID = 0.116), AttT2M (FID = 0.112) and the BERT-type model MMM (FID = 0.080). Furthermore, our model achieves the best quantitative performance in out-of-distribution generation.
