Motion-X++: A Large-Scale Multimodal 3D Whole-body Human Motion Dataset
Yuhong Zhang, Jing Lin, Ailing Zeng, Guanlin Wu, Shunlin Lu, Yurong Fu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, Lei Zhang
TL;DR
Motion-X++ tackles the scarcity of expressive, multimodal 3D whole-body motion datasets by introducing an automatic, scalable annotation pipeline and a large-scale dataset that combines 3D SMPL-X motions with frame- and sequence-level text, audio, and semantic labels. The dataset comprises 19.5M frame-level pose descriptions over 120.5K sequences and 120.5K sequence-level semantics, drawn from diverse indoor and outdoor scenes, with enhanced capture of facial expressions and hand gestures. Comprehensive experiments validate the annotation pipeline and show that Motion-X++ improves text-driven and audio-driven motion generation, mesh recovery, and 2D pose estimation compared to prior datasets. The work also discusses limitations of markerless capture and outlines future directions for multimodal pre-training and large-scale motion priors.
Abstract
In this paper, we introduce Motion-X++, a large-scale multimodal 3D expressive whole-body human motion dataset. Existing motion datasets predominantly capture body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions, and are typically limited to lab settings with manually labeled text descriptions, thereby restricting their scalability. To address this issue, we develop a scalable annotation pipeline that can automatically capture 3D whole-body human motion and comprehensive textural labels from RGB videos and build the Motion-X dataset comprising 81.1K text-motion pairs. Furthermore, we extend Motion-X into Motion-X++ by improving the annotation pipeline, introducing more data modalities, and scaling up the data quantities. Motion-X++ provides 19.5M 3D whole-body pose annotations covering 120.5K motion sequences from massive scenes, 80.8K RGB videos, 45.3K audios, 19.5M frame-level whole-body pose descriptions, and 120.5K sequence-level semantic labels. Comprehensive experiments validate the accuracy of our annotation pipeline and highlight Motion-X++'s significant benefits for generating expressive, precise, and natural motion with paired multimodal labels supporting several downstream tasks, including text-driven whole-body motion generation,audio-driven motion generation, 3D whole-body human mesh recovery, and 2D whole-body keypoints estimation, etc.
