Table of Contents
Fetching ...

RoleMotion: A Large-Scale Dataset towards Robust Scene-Specific Role-Playing Motion Synthesis with Fine-grained Descriptions

Junran Peng, Yiheng Huang, Silei Shen, Zeji Wei, Jingwei Yang, Baojie Wang, Yonghao He, Chuanchen Luo, Man Zhang, Xucheng Yin, Wei Sui

TL;DR

RoleMotion tackles a core bottleneck in text-to-motion research by offering a large-scale, scene-specific motion dataset captured from scratch with fine-grained textual descriptions. It pairs the data with a strong transformer-based evaluator and a new benchmark to align text prompts with body-and-hand motions, enabling rigorous comparison across methods. The study demonstrates that two-stage body-and-hand generation can improve textual alignment and motion quality, and provides qualitative and quantitative evidence that RoleMotion outperforms existing datasets in realism and controllability. This work delivers practical resources and evaluation tools to advance robust scene-specific motion synthesis for NPCs in virtual environments.

Abstract

In this paper, we introduce RoleMotion, a large-scale human motion dataset that encompasses a wealth of role-playing and functional motion data tailored to fit various specific scenes. Existing text datasets are mainly constructed decentrally as amalgamation of assorted subsets that their data are nonfunctional and isolated to work together to cover social activities in various scenes. Also, the quality of motion data is inconsistent, and textual annotation lacks fine-grained details in these datasets. In contrast, RoleMotion is meticulously designed and collected with a particular focus on scenes and roles. The dataset features 25 classic scenes, 110 functional roles, over 500 behaviors, and 10296 high-quality human motion sequences of body and hands, annotated with 27831 fine-grained text descriptions. We build an evaluator stronger than existing counterparts, prove its reliability, and evaluate various text-to-motion methods on our dataset. Finally, we explore the interplay of motion generation of body and hands. Experimental results demonstrate the high-quality and functionality of our dataset on text-driven whole-body generation.

RoleMotion: A Large-Scale Dataset towards Robust Scene-Specific Role-Playing Motion Synthesis with Fine-grained Descriptions

TL;DR

RoleMotion tackles a core bottleneck in text-to-motion research by offering a large-scale, scene-specific motion dataset captured from scratch with fine-grained textual descriptions. It pairs the data with a strong transformer-based evaluator and a new benchmark to align text prompts with body-and-hand motions, enabling rigorous comparison across methods. The study demonstrates that two-stage body-and-hand generation can improve textual alignment and motion quality, and provides qualitative and quantitative evidence that RoleMotion outperforms existing datasets in realism and controllability. This work delivers practical resources and evaluation tools to advance robust scene-specific motion synthesis for NPCs in virtual environments.

Abstract

In this paper, we introduce RoleMotion, a large-scale human motion dataset that encompasses a wealth of role-playing and functional motion data tailored to fit various specific scenes. Existing text datasets are mainly constructed decentrally as amalgamation of assorted subsets that their data are nonfunctional and isolated to work together to cover social activities in various scenes. Also, the quality of motion data is inconsistent, and textual annotation lacks fine-grained details in these datasets. In contrast, RoleMotion is meticulously designed and collected with a particular focus on scenes and roles. The dataset features 25 classic scenes, 110 functional roles, over 500 behaviors, and 10296 high-quality human motion sequences of body and hands, annotated with 27831 fine-grained text descriptions. We build an evaluator stronger than existing counterparts, prove its reliability, and evaluate various text-to-motion methods on our dataset. Finally, we explore the interplay of motion generation of body and hands. Experimental results demonstrate the high-quality and functionality of our dataset on text-driven whole-body generation.

Paper Structure

This paper contains 22 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The design structure of RoleMotion. RoleMotion is a large scale scene-specific role-playing human motion datasets of high-quality compared to existing text2motion datasets. It aims to cover complete social activities in various scenes. All motions are carefully performed by actors and captured through MoCap devices instead of amassing existing datasets. The text annotations are fine-grained that body state, side of part and direction are required to be specified clearly.
  • Figure 2: The data quality of HumanML3D raises significant concerns. More cases could be found in the supplementary material.
  • Figure 3: Qualitative results of models trained on HumanML3D and RoleMotion. Compared to HumanML3D, the motion data in RoleMotion is of high-quality and textual annotations are fine-grained that body state, side of part, and direction are required to be stipulated. Almost all the action details specified in the text are accurately completed by model trained on RoleMotion, except the case in the top right that both models fail to correctly perform " tilting head to the left". Motions of model trained on HumanML3D tend to be unnatural and contain many errors at the detail level. All the action categories in figures are included in both datasets.
  • Figure 4: A pipeline of two-stage body&hands motion generation.
  • Figure 5: Recon. results of VQVAE trained on RoleMotion.