Table of Contents
Fetching ...

Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data

Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, Jingbo Wang

TL;DR

The paper tackles zero-shot motion generation by scaling both data and model capacity. It introduces MotionMillion, a 2,000+ hour, 2M-sequence motion dataset with rich text captions, and MotionMillion-Eval, a standardized zero-shot benchmark. A decoder-only transformer with FSQ-based motion tokens and wavelet preprocessing, built on LLAMA/T5-XL, demonstrates strong zero-shot generalization to out-of-domain and complex compositional motions at up to 7B parameters. The work advances data-driven pathways for zero-shot motion generation and provides a rigorous evaluation framework for future comparisons.

Abstract

Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training datasets. Moreover, the lack of a comprehensive evaluation framework impedes the advancement of this task by failing to identify directions for improvement. In this work, we aim to push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot. To this end, firstly, we develop an efficient annotation pipeline and introduce MotionMillion-the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences. Additionally, we propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation. Leveraging a scalable architecture, we scale our model to 7B parameters and validate its performance on MotionMillion-Eval. Our results demonstrate strong generalization to out-of-domain and complex compositional motions, marking a significant step toward zero-shot human motion generation. The code is available at https://github.com/VankouF/MotionMillion-Codes.

Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data

TL;DR

The paper tackles zero-shot motion generation by scaling both data and model capacity. It introduces MotionMillion, a 2,000+ hour, 2M-sequence motion dataset with rich text captions, and MotionMillion-Eval, a standardized zero-shot benchmark. A decoder-only transformer with FSQ-based motion tokens and wavelet preprocessing, built on LLAMA/T5-XL, demonstrates strong zero-shot generalization to out-of-domain and complex compositional motions at up to 7B parameters. The work advances data-driven pathways for zero-shot motion generation and provides a rigorous evaluation framework for future comparisons.

Abstract

Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training datasets. Moreover, the lack of a comprehensive evaluation framework impedes the advancement of this task by failing to identify directions for improvement. In this work, we aim to push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot. To this end, firstly, we develop an efficient annotation pipeline and introduce MotionMillion-the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences. Additionally, we propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation. Leveraging a scalable architecture, we scale our model to 7B parameters and validate its performance on MotionMillion-Eval. Our results demonstrate strong generalization to out-of-domain and complex compositional motions, marking a significant step toward zero-shot human motion generation. The code is available at https://github.com/VankouF/MotionMillion-Codes.

Paper Structure

This paper contains 21 sections, 6 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: We present Go to Zero, where we can deal with out-domain and complex compositional motions.
  • Figure 2: Data Construction Pipeline of MotionMillion. We can obtain high-quality human motion from a monocular video via our six processing stages, i.e. Shot Segmentation, Human Detection, Video Filtering, SMPL Motion Estimation and Motion Filtering.
  • Figure 3: Overview of MotionMillion. This dataset exhibits extensive semantic and pose diversity, encompassing a broad spectrum of indoor and outdoor human motions.
  • Figure 4: Jerk comparison across MotionMillion, MotionX, and HumanML3D. Our MotionMillion exhibits the lowest jerk values, indicating that it produces smoother motion.
  • Figure 5: Overview of our scalable model architecture, which utilize FSQ as a motion tokenizer and an autoregressive transformer to generate the motion from the given text.
  • ...and 6 more figures