FG-MDM: Towards Zero-Shot Human Motion Generation via ChatGPT-Refined Descriptions
Xu Shi, Wei Yao, Chuanchen Luo, Junran Peng, Hongwen Zhang, Yunlian Sun
TL;DR
This work tackles zero_shot text_to_motion generation by converting vague textual descriptions into fine_grained, per_body_part annotations produced with a ChatGPT-based prompt strategy. It then trains a transformer_diffusion model that uses a global CLIP token plus per_part CLIP tokens to condition the denoising process, enabling motions that extend beyond the training data distribution. Quantitative and qualitative results on HumanML3D, KIT, HuMMan, and Kungfu demonstrate competitive zero_shot performance and strong generalization, supported by ablations and a user study. The approach offers practical gains in flexibility and realism for textDriven motion generation and provides publicly available fine_grained annotations to spur further research.
Abstract
Recently, significant progress has been made in text-based motion generation, enabling the generation of diverse and high-quality human motions that conform to textual descriptions. However, generating motions beyond the distribution of original datasets remains challenging, i.e., zero-shot generation. By adopting a divide-and-conquer strategy, we propose a new framework named Fine-Grained Human Motion Diffusion Model (FG-MDM) for zero-shot human motion generation. Specifically, we first parse previous vague textual annotations into fine-grained descriptions of different body parts by leveraging a large language model. We then use these fine-grained descriptions to guide a transformer-based diffusion model, which further adopts a design of part tokens. FG-MDM can generate human motions beyond the scope of original datasets owing to descriptions that are closer to motion essence. Our experimental results demonstrate the superiority of FG-MDM over previous methods in zero-shot settings. We will release our fine-grained textual annotations for HumanML3D and KIT.
