Table of Contents
Fetching ...

Next-Scale Autoregressive Models for Text-to-Motion Generation

Zhiwei Zheng, Shibo Jin, Lingjie Liu, Mingmin Zhao

Abstract

Autoregressive (AR) models offer stable and efficient training, but standard next-token prediction is not well aligned with the temporal structure required for text-conditioned motion generation. We introduce MoScale, a next-scale AR framework that generates motion hierarchically from coarse to fine temporal resolutions. By providing global semantics at the coarsest scale and refining them progressively, MoScale establishes a causal hierarchy better suited for long-range motion structure. To improve robustness under limited text-motion data, we further incorporate cross-scale hierarchical refinement for improving per-scale initial predictions and in-scale temporal refinement for selective bidirectional re-prediction. MoScale achieves SOTA text-to-motion performance with high training efficiency, scales effectively with model size, and generalizes zero-shot to diverse motion generation and editing tasks.

Next-Scale Autoregressive Models for Text-to-Motion Generation

Abstract

Autoregressive (AR) models offer stable and efficient training, but standard next-token prediction is not well aligned with the temporal structure required for text-conditioned motion generation. We introduce MoScale, a next-scale AR framework that generates motion hierarchically from coarse to fine temporal resolutions. By providing global semantics at the coarsest scale and refining them progressively, MoScale establishes a causal hierarchy better suited for long-range motion structure. To improve robustness under limited text-motion data, we further incorporate cross-scale hierarchical refinement for improving per-scale initial predictions and in-scale temporal refinement for selective bidirectional re-prediction. MoScale achieves SOTA text-to-motion performance with high training efficiency, scales effectively with model size, and generalizes zero-shot to diverse motion generation and editing tasks.

Paper Structure

This paper contains 16 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: MoScale accurately captures global semantic structure in text descriptions, such as twojumping jacks and sequential actions including turn around, pick things up, and turn around, where prior methods fail to align with the text. Our next-scale autoregressive design with hierarchical causality enables MoScale to preserve these long-range semantics while maintaining realistic motion.
  • Figure 2: Overview of MoScale. (a) MoScale encodes motion sequences into discrete tokens from coarse to fine through multi-scale quantization. (b) It autoregressively predicts tokens at the next scale, conditioned on the prefix and text inputs, using hierarchical scale-wise causal attention. (c) Within each scale, MoScale performs temporal refinement to further improve token quality and consistency.
  • Figure 3: Comparison of Top-1 text alignment and training time on HumanML3D guo2022generating. We provide four model sizes, tiny, small, medium, and large, for our method and T2M-GPT zhang2023generating.
  • Figure 4: Motion editing results. MoScale achieves better instruction adherence and retains unedited motion (shown in gray).
  • Figure 5: Iteration study.
  • ...and 1 more figures