Table of Contents
Fetching ...

MoSa: Motion Generation with Scalable Autoregressive Modeling

Mengyuan Liu, Sheng Yan, Yong Wang, Yingjie Li, Gui-Bin Bian, Hong Liu

TL;DR

MoSa addresses inefficiencies and misalignment in VQ-GT motion generation by introducing MTPS to preserve multi-scale tokens within a hierarchical RQ-VAE and enabling Scalable Autoregressive (SAR) modeling, reducing inference to the number of VQ layers (e.g., $Q=10$). The framework combines MTPS, SAR, and CAQ-VAE (a convolution-attention hybrid VQ-VAE) to achieve coherent coarse-to-fine motion generation with improved reconstruction and speed, demonstrated on HumanML3D and Motion-X where it outperforms baselines in FID and semantic fidelity while offering a 27% speedup. Extensive ablations validate CAQ-VAE components, scale-wise codebooks, and cross-scale attention as essential for high-quality generation, and MoSa extends naturally to motion editing without additional fine-tuning. Overall, MoSa advances text-driven motion synthesis by delivering high-quality, editable, and efficient motion generation suitable for real-time applications.

Abstract

We introduce MoSa, a novel hierarchical motion generation framework for text-driven 3D human motion generation that enhances the Vector Quantization-guided Generative Transformers (VQ-GT) paradigm through a coarse-to-fine scalable generation process. In MoSa, we propose a Multi-scale Token Preservation Strategy (MTPS) integrated into a hierarchical residual vector quantization variational autoencoder (RQ-VAE). MTPS employs interpolation at each hierarchical quantization to effectively retain coarse-to-fine multi-scale tokens. With this, the generative transformer supports Scalable Autoregressive (SAR) modeling, which predicts scale tokens, unlike traditional methods that predict only one token at each step. Consequently, MoSa requires only 10 inference steps, matching the number of RQ-VAE quantization layers. To address potential reconstruction degradation from frequent interpolation, we propose CAQ-VAE, a lightweight yet expressive convolution-attention hybrid VQ-VAE. CAQ-VAE enhances residual block design and incorporates attention mechanisms to better capture global dependencies. Extensive experiments show that MoSa achieves state-of-the-art generation quality and efficiency, outperforming prior methods in both fidelity and speed. On the Motion-X dataset, MoSa achieves an FID of 0.06 (versus MoMask's 0.20) while reducing inference time by 27 percent. Moreover, MoSa generalizes well to downstream tasks such as motion editing, requiring no additional fine-tuning. The code is available at https://mosa-web.github.io/MoSa-web

MoSa: Motion Generation with Scalable Autoregressive Modeling

TL;DR

MoSa addresses inefficiencies and misalignment in VQ-GT motion generation by introducing MTPS to preserve multi-scale tokens within a hierarchical RQ-VAE and enabling Scalable Autoregressive (SAR) modeling, reducing inference to the number of VQ layers (e.g., ). The framework combines MTPS, SAR, and CAQ-VAE (a convolution-attention hybrid VQ-VAE) to achieve coherent coarse-to-fine motion generation with improved reconstruction and speed, demonstrated on HumanML3D and Motion-X where it outperforms baselines in FID and semantic fidelity while offering a 27% speedup. Extensive ablations validate CAQ-VAE components, scale-wise codebooks, and cross-scale attention as essential for high-quality generation, and MoSa extends naturally to motion editing without additional fine-tuning. Overall, MoSa advances text-driven motion synthesis by delivering high-quality, editable, and efficient motion generation suitable for real-time applications.

Abstract

We introduce MoSa, a novel hierarchical motion generation framework for text-driven 3D human motion generation that enhances the Vector Quantization-guided Generative Transformers (VQ-GT) paradigm through a coarse-to-fine scalable generation process. In MoSa, we propose a Multi-scale Token Preservation Strategy (MTPS) integrated into a hierarchical residual vector quantization variational autoencoder (RQ-VAE). MTPS employs interpolation at each hierarchical quantization to effectively retain coarse-to-fine multi-scale tokens. With this, the generative transformer supports Scalable Autoregressive (SAR) modeling, which predicts scale tokens, unlike traditional methods that predict only one token at each step. Consequently, MoSa requires only 10 inference steps, matching the number of RQ-VAE quantization layers. To address potential reconstruction degradation from frequent interpolation, we propose CAQ-VAE, a lightweight yet expressive convolution-attention hybrid VQ-VAE. CAQ-VAE enhances residual block design and incorporates attention mechanisms to better capture global dependencies. Extensive experiments show that MoSa achieves state-of-the-art generation quality and efficiency, outperforming prior methods in both fidelity and speed. On the Motion-X dataset, MoSa achieves an FID of 0.06 (versus MoMask's 0.20) while reducing inference time by 27 percent. Moreover, MoSa generalizes well to downstream tasks such as motion editing, requiring no additional fine-tuning. The code is available at https://mosa-web.github.io/MoSa-web

Paper Structure

This paper contains 17 sections, 9 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Comparison between state-of-the-art method MoMask guo2024momask and our MoSa in the VQ-GT processes: (a) MoMask’s VQ. (b) Our VQ maintains a multi-scale token set via our proposed MTPS, which employs interpolation (downsample/upsampling) at each hierarchical quantization. (c) MoMask’s GT process relies on two independent transformers, leading to cross-layer misalignment. (d) Our GT process with a scalable autoregressive transformer shows cross-layer alignment.
  • Figure 2: Our MoSa framework overview. (a) Multi-scale Token Preservation Strategy (MTPS) integrated into a hierarchical RQ-VAE. MTPS employs interpolation (Downsampling/Upsampling operation) at each hierarchical quantization to effectively retain coarse-to-fine multi-scale token set $X$. The scales follow a predefined scheduler $S = (s_{1}, s_{2}, s_{3}, \dots, s_{Q})$, where $s_{q} \leq T$, representing a coarse-to-fine hierarchy. The illustration shows an example with $(s_{1}=3, s_{2}=6, s_{3}=10)$. (b) The multi-scale token set supervise Scalable Autoregressive (SAR) modeling. Given an input $(\texttt{[sos]}, x^{(1)}, x^{(2)}, \dots, x^{(Q-1)})$, the SAR predicts $(x^{(1)}, x^{(2)}, \dots, x^{(Q)})$, where multiple tokens within each scale are predicted in parallel. During training, a scale-wise attention mask ensures that each $x^{(q)}$ can only attend to $x^{\leq (q)}$. Notably, the $x^{(q)}$ contains $s_q$ tokens, while $x^{(q-1)}$ has only $s_{(q-1)}$ tokens. Before feeding $x^{(q-1)}$ into the Transformer, the $x^{(q-1)}$ will be Up-Downsampling to match $s_q$. As illustrated, the input representation of $x^{(2)}$ is derived from up-downsampling $x^{(1)}$, and $x^{(3)}$ from $x^{(2)}$.
  • Figure 3: Previous VQ-VAE compared to our CAQ-VAE. Our CAQ-VAE uses residual blocks with GroupNorm and SiLU, along with a self-attention layer to capture global dependencies.
  • Figure 4: Impact of multi-scale token set size on HumanML3D. Using the MoSa-mini, we trained both the VQ Model (Reconstruction task) and the Transformer for text-to-motion synthesis (Generation task) on the HumanML3D dataset. The x-axis represents the size of the multi-scale token set $Q$, which also determines the total inference steps (ranging from 6 to 15). The results indicate that $Q=10$ achieves the best overall balance across all metrics.
  • Figure 5: Qualitative evaluation on Motion-X dataset. Motions that align with key semantics are highlighted in yellow. For more dynamic visualizations, please refer to the project page.
  • ...and 3 more figures