Table of Contents
Fetching ...

OpenT2M: No-frill Motion Generation with Open-source,Large-scale, High-quality Data

Bin Cao, Sipeng Zheng, Hao Luo, Boyuan Li, Jing Liu, Zongqing Lu

Abstract

Text-to-motion (T2M) generation aims to create realistic human movements from text descriptions, with promising applications in animation and robotics. Despite recent progress, current T2M models perform poorly on unseen text descriptions due to the small scale and limited diversity of existing motion datasets. To address this problem, we introduce OpenT2M, a million-level, high-quality, and open-source motion dataset containing over 2800 hours of human motion. Each sequence undergoes rigorous quality control through physical feasibility validation and multi-granularity filtering, with detailed second-wise text annotations. We also develop an automated pipeline for creating long-horizon sequences, enabling complex motion generation. Building upon OpenT2M, we introduce MonoFrill, a pretrained motion model that achieves compelling T2M results without complicated designs or technique tricks as "frills". Its core component is 2D-PRQ, a novel motion tokenizer that captures spatiotemporal dependencies by dividing the human body into biology parts. Experiments show that OpenT2M significantly improves generalization of existing T2M models, while 2D-PRQ achieves superior reconstruction and strong zero-shot performance. We expect OpenT2M and MonoFrill will advance the T2M field by addressing longstanding data quality and benchmarking challenges.

OpenT2M: No-frill Motion Generation with Open-source,Large-scale, High-quality Data

Abstract

Text-to-motion (T2M) generation aims to create realistic human movements from text descriptions, with promising applications in animation and robotics. Despite recent progress, current T2M models perform poorly on unseen text descriptions due to the small scale and limited diversity of existing motion datasets. To address this problem, we introduce OpenT2M, a million-level, high-quality, and open-source motion dataset containing over 2800 hours of human motion. Each sequence undergoes rigorous quality control through physical feasibility validation and multi-granularity filtering, with detailed second-wise text annotations. We also develop an automated pipeline for creating long-horizon sequences, enabling complex motion generation. Building upon OpenT2M, we introduce MonoFrill, a pretrained motion model that achieves compelling T2M results without complicated designs or technique tricks as "frills". Its core component is 2D-PRQ, a novel motion tokenizer that captures spatiotemporal dependencies by dividing the human body into biology parts. Experiments show that OpenT2M significantly improves generalization of existing T2M models, while 2D-PRQ achieves superior reconstruction and strong zero-shot performance. We expect OpenT2M and MonoFrill will advance the T2M field by addressing longstanding data quality and benchmarking challenges.
Paper Structure (16 sections, 2 equations, 8 figures, 8 tables)

This paper contains 16 sections, 2 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: (Left) Visualization of text embeddings for the training and validation sets of HumanML3D and Motion-X. A substantial overlap between the splits indicates data leakage. To avoid this risk, we remove the overlap via data repartition (version denoted as $*$). (Right) However, we observe a drastic performance drop when experimenting on this repartitioned benchmark, which reveals the limited generalization capability of current methods when faced with out-of-domain data.
  • Figure 2: Data Curation pipeline.(a) We adopt a two-stage pipeline, including physically feasible validation and multi-granularity filter. (b) We adapt the interpolation-based method for motion curation and introduce an RL-policy for refinement. (c) For text annotation, we generate temporally aligned labels for each second of video, using them to synthesize a precise, semantic-rich description.
  • Figure 3: Model Overview. We propose an extendable, autoregressive (AR) and discrete T2M model with no frills. (left) Our core design 2D-PRQ divides the entire body into five parts, encoding and quantizing motion into a sequence of discrete part-level tokens. (right) The AR model takes text as input and predicts part-level motion tokens. We call this model "MonoFrill" to show its simplicity.
  • Figure 4: Visualization of generated long-horizon motions. Visualization results demonstrate the ability to generate long-horizon motion sequences that accurately align with complex texts.
  • Figure 5: Statistics of the OpenT2M dataset. (a) Motion sequence distribution (log scale). (b) Average motion length distribution.
  • ...and 3 more figures