Table of Contents
Fetching ...

Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation

Yin Wang, Mu Li, Jiapeng Liu, Zhiying Leng, Frederick W. B. Li, Ziyao Zhang, Xiaohui Liang

TL;DR

Fg-T2M++ addresses the challenge of fine-grained text-driven motion generation by decomposing prompts into body-part semantics via LLMs, encoding syntactic structure in hyperbolic space, and fusing information hierarchically through a diffusion-based generator. The LLMs Semantic Parsing module supplies part-level action descriptions and word semantics; the Hyperbolic Text Representation module leverages dependency trees and hyperbolic graph convolution to preserve hierarchical structure; the Multi-Modal Fusion module enables coarse-to-fine integration of text and motion features. Extensive experiments on HumanML3D and KIT-ML demonstrate state-of-the-art performance in both precision metrics (R-TOP, FID, MM-Dist) and qualitative fidelity, including complex long prompts. This work advances realistic, controllable motion synthesis for animation, AR/VR, and interactive systems by enabling finer-grained alignment between natural language and body kinematics.

Abstract

We address the challenging problem of fine-grained text-driven human motion generation. Existing works generate imprecise motions that fail to accurately capture relationships specified in text due to: (1) lack of effective text parsing for detailed semantic cues regarding body parts, (2) not fully modeling linguistic structures between words to comprehend text comprehensively. To tackle these limitations, we propose a novel fine-grained framework Fg-T2M++ that consists of: (1) an LLMs semantic parsing module to extract body part descriptions and semantics from text, (2) a hyperbolic text representation module to encode relational information between text units by embedding the syntactic dependency graph into hyperbolic space, and (3) a multi-modal fusion module to hierarchically fuse text and motion features. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that Fg-T2M++ outperforms SOTA methods, validating its ability to accurately generate motions adhering to comprehensive text semantics.

Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation

TL;DR

Fg-T2M++ addresses the challenge of fine-grained text-driven motion generation by decomposing prompts into body-part semantics via LLMs, encoding syntactic structure in hyperbolic space, and fusing information hierarchically through a diffusion-based generator. The LLMs Semantic Parsing module supplies part-level action descriptions and word semantics; the Hyperbolic Text Representation module leverages dependency trees and hyperbolic graph convolution to preserve hierarchical structure; the Multi-Modal Fusion module enables coarse-to-fine integration of text and motion features. Extensive experiments on HumanML3D and KIT-ML demonstrate state-of-the-art performance in both precision metrics (R-TOP, FID, MM-Dist) and qualitative fidelity, including complex long prompts. This work advances realistic, controllable motion synthesis for animation, AR/VR, and interactive systems by enabling finer-grained alignment between natural language and body kinematics.

Abstract

We address the challenging problem of fine-grained text-driven human motion generation. Existing works generate imprecise motions that fail to accurately capture relationships specified in text due to: (1) lack of effective text parsing for detailed semantic cues regarding body parts, (2) not fully modeling linguistic structures between words to comprehend text comprehensively. To tackle these limitations, we propose a novel fine-grained framework Fg-T2M++ that consists of: (1) an LLMs semantic parsing module to extract body part descriptions and semantics from text, (2) a hyperbolic text representation module to encode relational information between text units by embedding the syntactic dependency graph into hyperbolic space, and (3) a multi-modal fusion module to hierarchically fuse text and motion features. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that Fg-T2M++ outperforms SOTA methods, validating its ability to accurately generate motions adhering to comprehensive text semantics.

Paper Structure

This paper contains 42 sections, 14 equations, 24 figures, 6 tables.

Figures (24)

  • Figure 1: Our Fg-T2M++ excels in generating high-quality and diverse motion sequences, capturing fine-grained details embedded in the text prompts.
  • Figure 2: Overview of Fg-T2M++: Given a text prompt $c$, the reverse denoising process of the diffusion model starts from noisy motion data $X_T$ and produces clean motion data $X_0$. Initially, the text prompt undergoes LLMs semantic parsing to generate LLMs-parsed fine-grained descriptions. Then, both the text prompt and its parsed descriptions are input into the hyperbolic text representation module, which captures precise representations of text features. Finally, the noisy motion data $X_t$, along with the two fine-grained text features, are fed into the multi-modal fusion module to obtain the clean motion data $X_{t-1}$.
  • Figure 3: The prompt of strategy and example for LLMs Semantic Parsing.
  • Figure 4: Architecture of HTP. a): the process of text-tree structural construction by dependency analysis. b): the process of hyperbolic graph convolution in the hyperbolic space to grasp the texts' precise features. c): the process of cross-perception module to make full use of the LLMs-parsed fine-grained descriptions.
  • Figure 5: Illustration of two fusion methods in MMF. a) multi-modal sentence-level feature fusion and b) multi-modal word-level feature fusion.
  • ...and 19 more figures