Table of Contents
Fetching ...

FG-MDM: Towards Zero-Shot Human Motion Generation via ChatGPT-Refined Descriptions

Xu Shi, Wei Yao, Chuanchen Luo, Junran Peng, Hongwen Zhang, Yunlian Sun

TL;DR

This work tackles zero_shot text_to_motion generation by converting vague textual descriptions into fine_grained, per_body_part annotations produced with a ChatGPT-based prompt strategy. It then trains a transformer_diffusion model that uses a global CLIP token plus per_part CLIP tokens to condition the denoising process, enabling motions that extend beyond the training data distribution. Quantitative and qualitative results on HumanML3D, KIT, HuMMan, and Kungfu demonstrate competitive zero_shot performance and strong generalization, supported by ablations and a user study. The approach offers practical gains in flexibility and realism for textDriven motion generation and provides publicly available fine_grained annotations to spur further research.

Abstract

Recently, significant progress has been made in text-based motion generation, enabling the generation of diverse and high-quality human motions that conform to textual descriptions. However, generating motions beyond the distribution of original datasets remains challenging, i.e., zero-shot generation. By adopting a divide-and-conquer strategy, we propose a new framework named Fine-Grained Human Motion Diffusion Model (FG-MDM) for zero-shot human motion generation. Specifically, we first parse previous vague textual annotations into fine-grained descriptions of different body parts by leveraging a large language model. We then use these fine-grained descriptions to guide a transformer-based diffusion model, which further adopts a design of part tokens. FG-MDM can generate human motions beyond the scope of original datasets owing to descriptions that are closer to motion essence. Our experimental results demonstrate the superiority of FG-MDM over previous methods in zero-shot settings. We will release our fine-grained textual annotations for HumanML3D and KIT.

FG-MDM: Towards Zero-Shot Human Motion Generation via ChatGPT-Refined Descriptions

TL;DR

This work tackles zero_shot text_to_motion generation by converting vague textual descriptions into fine_grained, per_body_part annotations produced with a ChatGPT-based prompt strategy. It then trains a transformer_diffusion model that uses a global CLIP token plus per_part CLIP tokens to condition the denoising process, enabling motions that extend beyond the training data distribution. Quantitative and qualitative results on HumanML3D, KIT, HuMMan, and Kungfu demonstrate competitive zero_shot performance and strong generalization, supported by ablations and a user study. The approach offers practical gains in flexibility and realism for textDriven motion generation and provides publicly available fine_grained annotations to spur further research.

Abstract

Recently, significant progress has been made in text-based motion generation, enabling the generation of diverse and high-quality human motions that conform to textual descriptions. However, generating motions beyond the distribution of original datasets remains challenging, i.e., zero-shot generation. By adopting a divide-and-conquer strategy, we propose a new framework named Fine-Grained Human Motion Diffusion Model (FG-MDM) for zero-shot human motion generation. Specifically, we first parse previous vague textual annotations into fine-grained descriptions of different body parts by leveraging a large language model. We then use these fine-grained descriptions to guide a transformer-based diffusion model, which further adopts a design of part tokens. FG-MDM can generate human motions beyond the scope of original datasets owing to descriptions that are closer to motion essence. Our experimental results demonstrate the superiority of FG-MDM over previous methods in zero-shot settings. We will release our fine-grained textual annotations for HumanML3D and KIT.
Paper Structure (19 sections, 5 equations, 5 figures, 4 tables)

This paper contains 19 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: FG-MDM can generate high-quality human motions in zero-shot settings by using fine-grained descriptions of different body parts.
  • Figure 2: The overall pipeline of FG-MDM. The model learns the denoising process of the diffusion model from the motion $x_{t}^{1:n}$ at time step $t$ to the clean motion $\hat{x}_{0}^{1:n}$, given the text condition. The input text is first paraphrased by ChatGPT into fine-grained descriptions $D_{1:k}$ for different parts of the body, where $k$ denotes the number of body parts. These descriptions are then fed into a pre-trained CLIP text encoder and projected, along with the time step $t$, onto input tokens $PT_{1:k}$ of the transformer. The overall fine-grained text is further encoded into a global input token $GL$, providing holistic information. In the sampling process of the diffusion model, an initial random noise $x_{T}^{1:n}$ is sampled, and then $T$ iterations are performed to generate the clean motion $\hat{x}_{0}^{1:n}$. At each sampling step $t$, guided by $PT_{1:k}$ and $GL$, the transfomer encoder predicts the clean motion $\hat{x}_{0}^{1:n}$ which is then noised back to $x_{t-1}^{1:n}$.
  • Figure 3: Qualitative results with unseen motions. We compare our FG-MDM with MDM tevet2023human and MLD chen2023executing. All three models are trained on HumanML3D. For better visualization, some pose frames are shifted to prevent overlap. Please refer to supplementary materials for more video demos.
  • Figure 4: Qualitative results with unseen stylized motions. All three models are trained on HumanML3D. Please refer to supplementary materials for more video demos.
  • Figure 5: User study results. For each method, a color bar ranging from blue to red represents the percentage of text-to-motion match levels, with blue indicating the least match and red indicating the most match.