Table of Contents
Fetching ...

ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model

Shunlin Lu, Jingbo Wang, Zeyu Lu, Ling-Hao Chen, Wenxun Dai, Junting Dong, Zhiyang Dou, Bo Dai, Ruimao Zhang

TL;DR

A scalable motion generation framework that includes the motion tokenizer Motion FSQ-VAE and a text-prefix autoregressive transformer is introduced and the existence of scaling laws within the context of motion generation is confirmed for the first time.

Abstract

The scaling law has been validated in various domains, such as natural language processing (NLP) and massive computer vision tasks; however, its application to motion generation remains largely unexplored. In this paper, we introduce a scalable motion generation framework that includes the motion tokenizer Motion FSQ-VAE and a text-prefix autoregressive transformer. Through comprehensive experiments, we observe the scaling behavior of this system. For the first time, we confirm the existence of scaling laws within the context of motion generation. Specifically, our results demonstrate that the normalized test loss of our prefix autoregressive models adheres to a logarithmic law in relation to compute budgets. Furthermore, we also confirm the power law between Non-Vocabulary Parameters, Vocabulary Parameters, and Data Tokens with respect to compute budgets respectively. Leveraging the scaling law, we predict the optimal transformer size, vocabulary size, and data requirements for a compute budget of $1e18$. The test loss of the system, when trained with the optimal model size, vocabulary size, and required data, aligns precisely with the predicted test loss, thereby validating the scaling law.

ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model

TL;DR

A scalable motion generation framework that includes the motion tokenizer Motion FSQ-VAE and a text-prefix autoregressive transformer is introduced and the existence of scaling laws within the context of motion generation is confirmed for the first time.

Abstract

The scaling law has been validated in various domains, such as natural language processing (NLP) and massive computer vision tasks; however, its application to motion generation remains largely unexplored. In this paper, we introduce a scalable motion generation framework that includes the motion tokenizer Motion FSQ-VAE and a text-prefix autoregressive transformer. Through comprehensive experiments, we observe the scaling behavior of this system. For the first time, we confirm the existence of scaling laws within the context of motion generation. Specifically, our results demonstrate that the normalized test loss of our prefix autoregressive models adheres to a logarithmic law in relation to compute budgets. Furthermore, we also confirm the power law between Non-Vocabulary Parameters, Vocabulary Parameters, and Data Tokens with respect to compute budgets respectively. Leveraging the scaling law, we predict the optimal transformer size, vocabulary size, and data requirements for a compute budget of . The test loss of the system, when trained with the optimal model size, vocabulary size, and required data, aligns precisely with the predicted test loss, thereby validating the scaling law.

Paper Structure

This paper contains 23 sections, 15 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: The generation results of ScaMo-3B with a text input. Our model could deal with abstract sentences and long sentences.
  • Figure 2: We plot the relationship between normalized test loss and FLOPs for observing the scaling behavior. Overall, the larger model and larger vocabulary size can get better performances.
  • Figure 3: Scaling laws of ScaMo. (a) Power law between $N_{nv}$ and $N_v$. We could predict the $Nv$ precisely based on a given $N_{nv}$. (b) Logarithmic law between FLOPs $C$ and normalized test loss $\mathcal{L}_u$. We could predict the $\mathcal{L}_u$ precisely given a FLOPs $C$.
  • Figure 4: The frames statistics of MotionUnion dataset. Motion capture data accounts for the majority.
  • Figure 5: Overview of ScaMo architecture. (a) FSQ: Motion FSQ-VAE. We use one code quantization and $d=L=3$ as an example. The feature of other frames is quantized in the same way. (b) (c) Text-prefix Autoregressive Transformer: The text tokens are applied with bidirectional attention and the motion tokens are applied with causal attention. Motion tokens can attend all text tokens.
  • ...and 5 more figures