Table of Contents
Fetching ...

Scaling Large Motion Models with Million-Level Human Motions

Ye Wang, Sipeng Zheng, Bin Cao, Qianshan Wei, Weishuai Zeng, Qin Jin, Zongqing Lu

TL;DR

The paper tackles data scarcity in motion generation by building MotionLib, a million-level dataset with hierarchical text; proposes MotionBook encoding (2D-LFQ, SMPL-D135) to improve representation and enable large-scale training; introduces Being-M0, a decoder-only large motion model trained with MotionLib and instruction tuning; shows scaling data and model size yields better performance and generalization, achieving SOTA on HumanML3D and strong OOD results; discusses evaluation limitations and potential directions for more robust benchmarks.

Abstract

Inspired by the recent success of LLMs, the field of human motion understanding has increasingly shifted toward developing large motion models. Despite some progress, current efforts remain far from achieving truly generalist models, primarily due to the lack of massive high-quality data. To address this gap, we present MotionLib, the first million-level dataset for motion generation, which is at least 15$\times$ larger than existing counterparts and enriched with hierarchical text descriptions. Using MotionLib, we train a large motion model named \projname, demonstrating robust performance across a wide range of human activities, including unseen ones. Through systematic investigation, for the first time, we highlight the importance of scaling both data and model size for advancing motion generation, along with key insights to achieve this goal. To better integrate the motion modality, we propose Motionbook, an innovative motion encoding approach including (1) a compact yet lossless feature to represent motions; (2) a novel 2D lookup-free motion tokenizer that preserves fine-grained motion details while expanding codebook capacity, significantly enhancing the representational power of motion tokens. We believe this work lays the groundwork for developing more versatile and powerful motion generation models in the future. For further details, visit https://beingbeyond.github.io/Being-M0/.

Scaling Large Motion Models with Million-Level Human Motions

TL;DR

The paper tackles data scarcity in motion generation by building MotionLib, a million-level dataset with hierarchical text; proposes MotionBook encoding (2D-LFQ, SMPL-D135) to improve representation and enable large-scale training; introduces Being-M0, a decoder-only large motion model trained with MotionLib and instruction tuning; shows scaling data and model size yields better performance and generalization, achieving SOTA on HumanML3D and strong OOD results; discusses evaluation limitations and potential directions for more robust benchmarks.

Abstract

Inspired by the recent success of LLMs, the field of human motion understanding has increasingly shifted toward developing large motion models. Despite some progress, current efforts remain far from achieving truly generalist models, primarily due to the lack of massive high-quality data. To address this gap, we present MotionLib, the first million-level dataset for motion generation, which is at least 15 larger than existing counterparts and enriched with hierarchical text descriptions. Using MotionLib, we train a large motion model named \projname, demonstrating robust performance across a wide range of human activities, including unseen ones. Through systematic investigation, for the first time, we highlight the importance of scaling both data and model size for advancing motion generation, along with key insights to achieve this goal. To better integrate the motion modality, we propose Motionbook, an innovative motion encoding approach including (1) a compact yet lossless feature to represent motions; (2) a novel 2D lookup-free motion tokenizer that preserves fine-grained motion details while expanding codebook capacity, significantly enhancing the representational power of motion tokens. We believe this work lays the groundwork for developing more versatile and powerful motion generation models in the future. For further details, visit https://beingbeyond.github.io/Being-M0/.
Paper Structure (35 sections, 2 equations, 13 figures, 16 tables)

This paper contains 35 sections, 2 equations, 13 figures, 16 tables.

Figures (13)

  • Figure 1: TOP: While existing models perform well on small-scale datasets like Motion-X and HumanML3D, they struggle with out-of-domain concepts on MotionLib, exhibiting limited generalization. DOWN: Curves showing the effects of scaling up large motion models. MotionLib is the first large T2M dataset comparable in scale to visual benchmarks like ImageNet.
  • Figure 2: Examples from MotionLib, which encompasses a diverse range of human motions from web videos. It features various scenes, ranging from outdoor environments to indoor settings, and includes both clean, single-person scenarios as well as crowded, multi-person scenes. MotionLib provides over 2.4M motion-text pairs in total. The whole illustration with more examples can be seen in Figure \ref{['fig:motionlib']}.
  • Figure 3: Overview of our large motion model named Being-M0, which can be divided into two stages. In the first stage (left), we pre-train a motion VQ-VAE to quantify motion sequences into tokens. In the second stage (right), we fine-tune an autoregressive language model to predict motion tokens.
  • Figure 4: Comparison with different motion quantization on Motion-X (left) and MotionLib (right). We only show MPJPE ($\downarrow$) results here due to space limitation, with FID results shown in Figure \ref{['fig:app_motion_quant_FID']}.
  • Figure 5: Illustration of examples in MotionLib, each sample is a motion sequnce extracted from an online video.
  • ...and 8 more figures