Table of Contents
Fetching ...

Motion Generation from Fine-grained Textual Descriptions

Kunhang Li, Yansong Feng

TL;DR

A new text2motion model, FineMotionDiffuse, making full use of fine-grained textual information is designed, which outperforms MotionDiffuse in generating spatially or chronologically composite motions, by learning the implicit mappings from fine-grained descriptions to the corresponding basic motions.

Abstract

The task of text2motion is to generate human motion sequences from given textual descriptions, where the model explores diverse mappings from natural language instructions to human body movements. While most existing works are confined to coarse-grained motion descriptions, e.g., "A man squats.", fine-grained descriptions specifying movements of relevant body parts are barely explored. Models trained with coarse-grained texts may not be able to learn mappings from fine-grained motion-related words to motion primitives, resulting in the failure to generate motions from unseen descriptions. In this paper, we build a large-scale language-motion dataset specializing in fine-grained textual descriptions, FineHumanML3D, by feeding GPT-3.5-turbo with step-by-step instructions with pseudo-code compulsory checks. Accordingly, we design a new text2motion model, FineMotionDiffuse, making full use of fine-grained textual information. Our quantitative evaluation shows that FineMotionDiffuse trained on FineHumanML3D improves FID by a large margin of 0.38, compared with competitive baselines. According to the qualitative evaluation and case study, our model outperforms MotionDiffuse in generating spatially or chronologically composite motions, by learning the implicit mappings from fine-grained descriptions to the corresponding basic motions. We release our data at https://github.com/KunhangL/finemotiondiffuse.

Motion Generation from Fine-grained Textual Descriptions

TL;DR

A new text2motion model, FineMotionDiffuse, making full use of fine-grained textual information is designed, which outperforms MotionDiffuse in generating spatially or chronologically composite motions, by learning the implicit mappings from fine-grained descriptions to the corresponding basic motions.

Abstract

The task of text2motion is to generate human motion sequences from given textual descriptions, where the model explores diverse mappings from natural language instructions to human body movements. While most existing works are confined to coarse-grained motion descriptions, e.g., "A man squats.", fine-grained descriptions specifying movements of relevant body parts are barely explored. Models trained with coarse-grained texts may not be able to learn mappings from fine-grained motion-related words to motion primitives, resulting in the failure to generate motions from unseen descriptions. In this paper, we build a large-scale language-motion dataset specializing in fine-grained textual descriptions, FineHumanML3D, by feeding GPT-3.5-turbo with step-by-step instructions with pseudo-code compulsory checks. Accordingly, we design a new text2motion model, FineMotionDiffuse, making full use of fine-grained textual information. Our quantitative evaluation shows that FineMotionDiffuse trained on FineHumanML3D improves FID by a large margin of 0.38, compared with competitive baselines. According to the qualitative evaluation and case study, our model outperforms MotionDiffuse in generating spatially or chronologically composite motions, by learning the implicit mappings from fine-grained descriptions to the corresponding basic motions. We release our data at https://github.com/KunhangL/finemotiondiffuse.
Paper Structure (32 sections, 2 equations, 5 figures, 7 tables)

This paper contains 32 sections, 2 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: An example motion sequence with a coarse-grained description and its fine-grained version.
  • Figure 2: A fine-grained description along with its pseudo-codes acquired through P8.
  • Figure 3: An overview of our FineMotionDiffuse model. In the diffusion block (right), blue lines indicate training flow, and brown lines for inference.
  • Figure 4: Example motions with spatial compositionality.$\color{red}C$ denotes inputting coarse-grained texts, $\color{red}F$ denotes inputting fine-grained texts, and $\color{red}C+F$ denotes inputting both texts. Due to space limits, we only display coarse-grained descriptions here. Fine-grained descriptions are shown in Appendix \ref{['appendix_cases']}.
  • Figure 5: Examples with chronological compositionality. Fine-grained descriptions are in Appendix \ref{['appendix_cases']}.