Table of Contents
Fetching ...

ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis

KunHo Heo, SuYeon Kim, Yonghyun Gwon, Youngbin Kim, MyeongAh Cho

TL;DR

This work proposes ParTY, a novel framework that enhances part expressiveness while generating coherent full-body motions and achieves substantial improvements over previous methods.

Abstract

Text-to-motion synthesis aims to generate natural and expressive human motions from textual descriptions. While existing approaches primarily focus on generating holistic motions from text descriptions, they struggle to accurately reflect actions involving specific body parts. Recent part-wise motion generation methods attempt to resolve this but face two critical limitations: (i) they lack explicit mechanisms for aligning textual semantics with individual body parts, and (ii) they often generate incoherent full-body motions due to integrating independently generated part motions. To overcome these issues and resolve the fundamental trade-off in existing methods, we propose ParTY, a novel framework that enhances part expressiveness while generating coherent full-body motions. ParTY comprises: (1) Part-Guided Network, which first generates part motions to obtain part guidance, then uses it to generate holistic motions; (2) Part-aware Text Grounding, which diversely transforms text embeddings and appropriately aligns them with each body part; and (3) Holistic-Part Fusion, which adaptively fuses holistic motions and part motions. Extensive experiments, including part-level and coherence-level evaluations, demonstrate that ParTY achieves substantial improvements over previous methods.

ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis

TL;DR

This work proposes ParTY, a novel framework that enhances part expressiveness while generating coherent full-body motions and achieves substantial improvements over previous methods.

Abstract

Text-to-motion synthesis aims to generate natural and expressive human motions from textual descriptions. While existing approaches primarily focus on generating holistic motions from text descriptions, they struggle to accurately reflect actions involving specific body parts. Recent part-wise motion generation methods attempt to resolve this but face two critical limitations: (i) they lack explicit mechanisms for aligning textual semantics with individual body parts, and (ii) they often generate incoherent full-body motions due to integrating independently generated part motions. To overcome these issues and resolve the fundamental trade-off in existing methods, we propose ParTY, a novel framework that enhances part expressiveness while generating coherent full-body motions. ParTY comprises: (1) Part-Guided Network, which first generates part motions to obtain part guidance, then uses it to generate holistic motions; (2) Part-aware Text Grounding, which diversely transforms text embeddings and appropriately aligns them with each body part; and (3) Holistic-Part Fusion, which adaptively fuses holistic motions and part motions. Extensive experiments, including part-level and coherence-level evaluations, demonstrate that ParTY achieves substantial improvements over previous methods.
Paper Structure (45 sections, 21 equations, 15 figures, 24 tables)

This paper contains 45 sections, 21 equations, 15 figures, 24 tables.

Figures (15)

  • Figure 1: (a) Holistic methods maintain coherence well but limited part-text alignment. In contrast, (b) Part-wise methods show enhanced part-text alignment (e.g., correctly performing the left leg lunge) but compromised coherence as a trade-off (e.g., neck distortion and misaligned arm and leg movements). (c) Our ParTY resolves this trade-off by achieving superior performance in both part-text alignment and coherence.
  • Figure 2: Architecture of the Temporal-aware VQ-VAE. Part VQ-VAE follows an identical architecture, where the sole distinction lies in processing part-level rather than full-body motion data.
  • Figure 3: Overview of ParTY. Text embeddings are processed through Part-aware Text Grounding, then part transformers generate Part Guidance for the holistic transformer to generate motion tokens, with Holistic-Part Fusion applied during generation. The notation {Part} indicates that the process is performed for both arms and legs.
  • Figure 4: Qualitative comparison on HumanML3D. Colored text in the descriptions corresponds to the colored body parts in the generated motions, with coherence-level (TC, SC) scores displayed for each sample.
  • Figure 5: Visualization of cross attention map of HPF. Rows correspond to body parts and columns represent temporal frames. We visualize the normalized attention weights between the holistic motion token and each part motion token.
  • ...and 10 more figures