Table of Contents
Fetching ...

Motion-X++: A Large-Scale Multimodal 3D Whole-body Human Motion Dataset

Yuhong Zhang, Jing Lin, Ailing Zeng, Guanlin Wu, Shunlin Lu, Yurong Fu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, Lei Zhang

TL;DR

Motion-X++ tackles the scarcity of expressive, multimodal 3D whole-body motion datasets by introducing an automatic, scalable annotation pipeline and a large-scale dataset that combines 3D SMPL-X motions with frame- and sequence-level text, audio, and semantic labels. The dataset comprises 19.5M frame-level pose descriptions over 120.5K sequences and 120.5K sequence-level semantics, drawn from diverse indoor and outdoor scenes, with enhanced capture of facial expressions and hand gestures. Comprehensive experiments validate the annotation pipeline and show that Motion-X++ improves text-driven and audio-driven motion generation, mesh recovery, and 2D pose estimation compared to prior datasets. The work also discusses limitations of markerless capture and outlines future directions for multimodal pre-training and large-scale motion priors.

Abstract

In this paper, we introduce Motion-X++, a large-scale multimodal 3D expressive whole-body human motion dataset. Existing motion datasets predominantly capture body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions, and are typically limited to lab settings with manually labeled text descriptions, thereby restricting their scalability. To address this issue, we develop a scalable annotation pipeline that can automatically capture 3D whole-body human motion and comprehensive textural labels from RGB videos and build the Motion-X dataset comprising 81.1K text-motion pairs. Furthermore, we extend Motion-X into Motion-X++ by improving the annotation pipeline, introducing more data modalities, and scaling up the data quantities. Motion-X++ provides 19.5M 3D whole-body pose annotations covering 120.5K motion sequences from massive scenes, 80.8K RGB videos, 45.3K audios, 19.5M frame-level whole-body pose descriptions, and 120.5K sequence-level semantic labels. Comprehensive experiments validate the accuracy of our annotation pipeline and highlight Motion-X++'s significant benefits for generating expressive, precise, and natural motion with paired multimodal labels supporting several downstream tasks, including text-driven whole-body motion generation,audio-driven motion generation, 3D whole-body human mesh recovery, and 2D whole-body keypoints estimation, etc.

Motion-X++: A Large-Scale Multimodal 3D Whole-body Human Motion Dataset

TL;DR

Motion-X++ tackles the scarcity of expressive, multimodal 3D whole-body motion datasets by introducing an automatic, scalable annotation pipeline and a large-scale dataset that combines 3D SMPL-X motions with frame- and sequence-level text, audio, and semantic labels. The dataset comprises 19.5M frame-level pose descriptions over 120.5K sequences and 120.5K sequence-level semantics, drawn from diverse indoor and outdoor scenes, with enhanced capture of facial expressions and hand gestures. Comprehensive experiments validate the annotation pipeline and show that Motion-X++ improves text-driven and audio-driven motion generation, mesh recovery, and 2D pose estimation compared to prior datasets. The work also discusses limitations of markerless capture and outlines future directions for multimodal pre-training and large-scale motion priors.

Abstract

In this paper, we introduce Motion-X++, a large-scale multimodal 3D expressive whole-body human motion dataset. Existing motion datasets predominantly capture body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions, and are typically limited to lab settings with manually labeled text descriptions, thereby restricting their scalability. To address this issue, we develop a scalable annotation pipeline that can automatically capture 3D whole-body human motion and comprehensive textural labels from RGB videos and build the Motion-X dataset comprising 81.1K text-motion pairs. Furthermore, we extend Motion-X into Motion-X++ by improving the annotation pipeline, introducing more data modalities, and scaling up the data quantities. Motion-X++ provides 19.5M 3D whole-body pose annotations covering 120.5K motion sequences from massive scenes, 80.8K RGB videos, 45.3K audios, 19.5M frame-level whole-body pose descriptions, and 120.5K sequence-level semantic labels. Comprehensive experiments validate the accuracy of our annotation pipeline and highlight Motion-X++'s significant benefits for generating expressive, precise, and natural motion with paired multimodal labels supporting several downstream tasks, including text-driven whole-body motion generation,audio-driven motion generation, 3D whole-body human mesh recovery, and 2D whole-body keypoints estimation, etc.
Paper Structure (22 sections, 15 equations, 22 figures, 11 tables)

This paper contains 22 sections, 15 equations, 22 figures, 11 tables.

Figures (22)

  • Figure 1: Compared to Motion-X, our enhanced dataset Motion-X++ offers (a) more precise human motion, including robust facial expressions and refined hand gestures. Facial expressions and hand gestures are highlighted. Additionally, Motion-X++ provides a broader range of modalities, such as audio and video, and improved quality in text (annotated by GPT-4V) and motion (refined annotation pipeline). The expanded modalities enable Motion-X++ to support additional downstream tasks, including video generation, whole-body pose estimation, and audio-driven motion generation, beyond mesh recovery and text-driven motion generation supported by Motion-X. (c) illustrates a comparison between Motion-X and Motion-X++, demonstrating more expressive language captions and more precise hand gestures provided by Motion-X++.
  • Figure 2: Diversity statistics of the face, hand, and body motions in Motion-X++.
  • Figure 3: Statistics of sub-datasets. B, H, F are body, hand, and face. S and P are semantic and pose texts. P-GT is pseudo ground truth. * denotes that videos are collected by us.
  • Figure 3: Illustration of the overall data collection and annotation pipeline.
  • Figure 4: Overview of Motion-X++. It includes (a) diverse facial expressions extracted from BAUM baum, (b) indoor motion with expressive face and hand motions, (c) outdoor motion with diverse and challenging poses, and (d) several motion sequences. Purple SMPL-X is the observed frame, and the others are neighboring poses.
  • ...and 17 more figures