Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset
Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, Lei Zhang
TL;DR
Motion-X addresses the scarcity of expressive, full-body motion data by introducing a scalable, automatic annotation pipeline that delivers precise $SMPL-X$ 3D poses and rich text descriptions from massive multi-view and monocular videos. The dataset comprises 15.6M frames across 81.1K sequences, with 15.6M frame-level pose descriptions and 81.1K sequence-level semantic labels, spanning indoor and outdoor scenes and augmented with facial expressions via BAUM and IDEA400 coverage. Core technical contributions include a hierarchical 2D keypoint estimator, score-guided adaptive smoothing, learning-based 3D SMPL-X fitting, and global motion optimization, plus automatic generation of pose descriptions and semantic labels. Experimental results demonstrate improved annotation accuracy, enhanced text-driven whole-body motion generation, and tangible gains in whole-body mesh recovery, underscoring Motion-X’s potential to advance expressive motion synthesis and multi-modal understanding in real-world scenarios.
Abstract
In this paper, we present Motion-X, a large-scale 3D expressive whole-body motion dataset. Existing motion datasets predominantly contain body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions. Moreover, they are primarily collected from limited laboratory scenes with textual descriptions manually labeled, which greatly limits their scalability. To overcome these limitations, we develop a whole-body motion and text annotation pipeline, which can automatically annotate motion from either single- or multi-view videos and provide comprehensive semantic labels for each video and fine-grained whole-body pose descriptions for each frame. This pipeline is of high precision, cost-effective, and scalable for further research. Based on it, we construct Motion-X, which comprises 15.6M precise 3D whole-body pose annotations (i.e., SMPL-X) covering 81.1K motion sequences from massive scenes. Besides, Motion-X provides 15.6M frame-level whole-body pose descriptions and 81.1K sequence-level semantic labels. Comprehensive experiments demonstrate the accuracy of the annotation pipeline and the significant benefit of Motion-X in enhancing expressive, diverse, and natural motion generation, as well as 3D whole-body human mesh recovery.
