Table of Contents
Fetching ...

Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset

Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, Lei Zhang

TL;DR

Motion-X addresses the scarcity of expressive, full-body motion data by introducing a scalable, automatic annotation pipeline that delivers precise $SMPL-X$ 3D poses and rich text descriptions from massive multi-view and monocular videos. The dataset comprises 15.6M frames across 81.1K sequences, with 15.6M frame-level pose descriptions and 81.1K sequence-level semantic labels, spanning indoor and outdoor scenes and augmented with facial expressions via BAUM and IDEA400 coverage. Core technical contributions include a hierarchical 2D keypoint estimator, score-guided adaptive smoothing, learning-based 3D SMPL-X fitting, and global motion optimization, plus automatic generation of pose descriptions and semantic labels. Experimental results demonstrate improved annotation accuracy, enhanced text-driven whole-body motion generation, and tangible gains in whole-body mesh recovery, underscoring Motion-X’s potential to advance expressive motion synthesis and multi-modal understanding in real-world scenarios.

Abstract

In this paper, we present Motion-X, a large-scale 3D expressive whole-body motion dataset. Existing motion datasets predominantly contain body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions. Moreover, they are primarily collected from limited laboratory scenes with textual descriptions manually labeled, which greatly limits their scalability. To overcome these limitations, we develop a whole-body motion and text annotation pipeline, which can automatically annotate motion from either single- or multi-view videos and provide comprehensive semantic labels for each video and fine-grained whole-body pose descriptions for each frame. This pipeline is of high precision, cost-effective, and scalable for further research. Based on it, we construct Motion-X, which comprises 15.6M precise 3D whole-body pose annotations (i.e., SMPL-X) covering 81.1K motion sequences from massive scenes. Besides, Motion-X provides 15.6M frame-level whole-body pose descriptions and 81.1K sequence-level semantic labels. Comprehensive experiments demonstrate the accuracy of the annotation pipeline and the significant benefit of Motion-X in enhancing expressive, diverse, and natural motion generation, as well as 3D whole-body human mesh recovery.

Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset

TL;DR

Motion-X addresses the scarcity of expressive, full-body motion data by introducing a scalable, automatic annotation pipeline that delivers precise 3D poses and rich text descriptions from massive multi-view and monocular videos. The dataset comprises 15.6M frames across 81.1K sequences, with 15.6M frame-level pose descriptions and 81.1K sequence-level semantic labels, spanning indoor and outdoor scenes and augmented with facial expressions via BAUM and IDEA400 coverage. Core technical contributions include a hierarchical 2D keypoint estimator, score-guided adaptive smoothing, learning-based 3D SMPL-X fitting, and global motion optimization, plus automatic generation of pose descriptions and semantic labels. Experimental results demonstrate improved annotation accuracy, enhanced text-driven whole-body motion generation, and tangible gains in whole-body mesh recovery, underscoring Motion-X’s potential to advance expressive motion synthesis and multi-modal understanding in real-world scenarios.

Abstract

In this paper, we present Motion-X, a large-scale 3D expressive whole-body motion dataset. Existing motion datasets predominantly contain body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions. Moreover, they are primarily collected from limited laboratory scenes with textual descriptions manually labeled, which greatly limits their scalability. To overcome these limitations, we develop a whole-body motion and text annotation pipeline, which can automatically annotate motion from either single- or multi-view videos and provide comprehensive semantic labels for each video and fine-grained whole-body pose descriptions for each frame. This pipeline is of high precision, cost-effective, and scalable for further research. Based on it, we construct Motion-X, which comprises 15.6M precise 3D whole-body pose annotations (i.e., SMPL-X) covering 81.1K motion sequences from massive scenes. Besides, Motion-X provides 15.6M frame-level whole-body pose descriptions and 81.1K sequence-level semantic labels. Comprehensive experiments demonstrate the accuracy of the annotation pipeline and the significant benefit of Motion-X in enhancing expressive, diverse, and natural motion generation, as well as 3D whole-body human mesh recovery.
Paper Structure (29 sections, 6 equations, 18 figures, 9 tables)

This paper contains 29 sections, 6 equations, 18 figures, 9 tables.

Figures (18)

  • Figure 1: Different from (a) previous motion dataset humanml3dbabel, (b) our dataset captures body, facial expressions, and hand gestures. We highlight the comparisons of facial expressions and hand gestures.
  • Figure 2: Statistics of sub-datasets. B, H, F are body, hand, and face. S and P are semantic and pose texts. P-GT is pseudo ground truth. * denotes videos are collected by us.
  • Figure 3: Illustration of the overall data collection and annotation pipeline.
  • Figure 3: Evaluation of motion annotation pipeline on (a) 2D keypoints and (b) 3D SMPL-X datasets.
  • Figure 4: Overview of Motion-X. It includes: (a) diverse facial expressions extracted from BAUM baum, (b) indoor motion with expressive face and hand motions, (c) outdoor motion with diverse and challenging poses, and (d) several motion sequences. Purple SMPL-X is the observed frame, and the others are neighboring poses.
  • ...and 13 more figures