Table of Contents
Fetching ...

ParCo: Part-Coordinating Text-to-Motion Synthesis

Qiran Zou, Shangyuan Yuan, Shian Du, Yu Wang, Chang Liu, Yi Xu, Jie Chen, Xiangyang Ji

TL;DR

ParCo addresses the challenge of text-to-motion synthesis by enforcing fine-grained part awareness and cross-part coordination. It introduces a two-stage pipeline: first discretizing whole-body motion into six part motions with six VQ-VAEs to form priors, then employing six Part-Coordinated Transformers coupled with a central coordination module to generate coordinated part motions. The method achieves superior or competitive results on HumanML3D and KIT-ML benchmarks while reducing computation and parameters relative to prior approaches, demonstrating stronger text-motion alignment and precise part control. Ablation and robustness analyses corroborate the benefits of six-part partitioning and explicit inter-part communication, with potential for hierarchical extension and broader applicability.

Abstract

We study a challenging task: text-to-motion synthesis, aiming to generate motions that align with textual descriptions and exhibit coordinated movements. Currently, the part-based methods introduce part partition into the motion synthesis process to achieve finer-grained generation. However, these methods encounter challenges such as the lack of coordination between different part motions and difficulties for networks to understand part concepts. Moreover, introducing finer-grained part concepts poses computational complexity challenges. In this paper, we propose Part-Coordinating Text-to-Motion Synthesis (ParCo), endowed with enhanced capabilities for understanding part motions and communication among different part motion generators, ensuring a coordinated and fined-grained motion synthesis. Specifically, we discretize whole-body motion into multiple part motions to establish the prior concept of different parts. Afterward, we employ multiple lightweight generators designed to synthesize different part motions and coordinate them through our part coordination module. Our approach demonstrates superior performance on common benchmarks with economic computations, including HumanML3D and KIT-ML, providing substantial evidence of its effectiveness. Code is available at https://github.com/qrzou/ParCo .

ParCo: Part-Coordinating Text-to-Motion Synthesis

TL;DR

ParCo addresses the challenge of text-to-motion synthesis by enforcing fine-grained part awareness and cross-part coordination. It introduces a two-stage pipeline: first discretizing whole-body motion into six part motions with six VQ-VAEs to form priors, then employing six Part-Coordinated Transformers coupled with a central coordination module to generate coordinated part motions. The method achieves superior or competitive results on HumanML3D and KIT-ML benchmarks while reducing computation and parameters relative to prior approaches, demonstrating stronger text-motion alignment and precise part control. Ablation and robustness analyses corroborate the benefits of six-part partitioning and explicit inter-part communication, with potential for hierarchical extension and broader applicability.

Abstract

We study a challenging task: text-to-motion synthesis, aiming to generate motions that align with textual descriptions and exhibit coordinated movements. Currently, the part-based methods introduce part partition into the motion synthesis process to achieve finer-grained generation. However, these methods encounter challenges such as the lack of coordination between different part motions and difficulties for networks to understand part concepts. Moreover, introducing finer-grained part concepts poses computational complexity challenges. In this paper, we propose Part-Coordinating Text-to-Motion Synthesis (ParCo), endowed with enhanced capabilities for understanding part motions and communication among different part motion generators, ensuring a coordinated and fined-grained motion synthesis. Specifically, we discretize whole-body motion into multiple part motions to establish the prior concept of different parts. Afterward, we employ multiple lightweight generators designed to synthesize different part motions and coordinate them through our part coordination module. Our approach demonstrates superior performance on common benchmarks with economic computations, including HumanML3D and KIT-ML, providing substantial evidence of its effectiveness. Code is available at https://github.com/qrzou/ParCo .
Paper Structure (29 sections, 5 equations, 10 figures, 8 tables)

This paper contains 29 sections, 5 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Our ParCo is capable of coordinating the motion of various body parts to produce realistic and accurate motion.
  • Figure 2: Conceptual comparison of three part-based synthesis methods. (a): One generator synthesizes the whole-body embedding, which contains information about different parts internally. (b): Two separate generators synthesize the upper and lower body's motions independently, without information exchange between them. (c): Our ParCo employs multiple lightweight generators designed to synthesize different part motions, which are coordinated by the Part Coordination module.
  • Figure 3: Pipeline of ParCo. ParCo consists of two stages: (a) The whole-body motion is discretized into 6 part motions, and encoded into 6 quantized code index sequences by 6 VQ-VAEs (encoder and quantizer). This process provides a priori about the concept of part motions for the second stage. (b) We use the quantized index sequences and corresponding textual description to train 6 transformers for part motion generation. At the same time, these generators are coordinated by our Part Coordination module. The generated part motion codes are decoded by VQ-VAE (decoder) to reconstruct the 6 part motions, which will be integrated into the final whole-body motion.
  • Figure 4: The architecture of our Part-Coordinated Transformer.
  • Figure 5: Qualitative comparison with existing methods. Green indicates the motion is consistent with the text description. Red indicates the text description lacks the corresponding motion or got the wrong motion.
  • ...and 5 more figures