Table of Contents
Fetching ...

CoMA: Compositional Human Motion Generation with Multi-modal Agents

Shanlin Sun, Gabriel De Araujo, Jiaqi Xu, Shenghan Zhou, Hanwen Zhang, Ziheng Huang, Chenyu You, Xiaohui Xie

TL;DR

CoMA tackles the scarcity of diverse motion data and the difficulty of describing complex motions by introducing a multi-modal, agent-based system that combines an LLM-driven Task Planner, a SPAM-based Motion Generator, a Trajectory Editor, and a Motion Reviewer for iterative self-correction. The core innovation, SPAM, employs four body-part specific VQVAE codebooks plus a spatially-aware transformer to enable fine-grained, context-aware motion generation and editing, while trajectory handling and motion-captioning components support long-horizon sequences and self-verification. Evaluations on HumanML3D show competitive performance against state-of-the-art methods, with strong results on complex, context-rich prompts and high-quality motion captions, supported by user studies. The framework demonstrates substantial practical value for generating, editing, and understanding detailed human motions, offering a robust path toward more capable, instruction-following motion generation without additional training data beyond established datasets.

Abstract

3D human motion generation has seen substantial advancement in recent years. While state-of-the-art approaches have improved performance significantly, they still struggle with complex and detailed motions unseen in training data, largely due to the scarcity of motion datasets and the prohibitive cost of generating new training examples. To address these challenges, we introduce CoMA, an agent-based solution for complex human motion generation, editing, and comprehension. CoMA leverages multiple collaborative agents powered by large language and vision models, alongside a mask transformer-based motion generator featuring body part-specific encoders and codebooks for fine-grained control. Our framework enables generation of both short and long motion sequences with detailed instructions, text-guided motion editing, and self-correction for improved quality. Evaluations on the HumanML3D dataset demonstrate competitive performance against state-of-the-art methods. Additionally, we create a set of context-rich, compositional, and long text prompts, where user studies show our method significantly outperforms existing approaches.

CoMA: Compositional Human Motion Generation with Multi-modal Agents

TL;DR

CoMA tackles the scarcity of diverse motion data and the difficulty of describing complex motions by introducing a multi-modal, agent-based system that combines an LLM-driven Task Planner, a SPAM-based Motion Generator, a Trajectory Editor, and a Motion Reviewer for iterative self-correction. The core innovation, SPAM, employs four body-part specific VQVAE codebooks plus a spatially-aware transformer to enable fine-grained, context-aware motion generation and editing, while trajectory handling and motion-captioning components support long-horizon sequences and self-verification. Evaluations on HumanML3D show competitive performance against state-of-the-art methods, with strong results on complex, context-rich prompts and high-quality motion captions, supported by user studies. The framework demonstrates substantial practical value for generating, editing, and understanding detailed human motions, offering a robust path toward more capable, instruction-following motion generation without additional training data beyond established datasets.

Abstract

3D human motion generation has seen substantial advancement in recent years. While state-of-the-art approaches have improved performance significantly, they still struggle with complex and detailed motions unseen in training data, largely due to the scarcity of motion datasets and the prohibitive cost of generating new training examples. To address these challenges, we introduce CoMA, an agent-based solution for complex human motion generation, editing, and comprehension. CoMA leverages multiple collaborative agents powered by large language and vision models, alongside a mask transformer-based motion generator featuring body part-specific encoders and codebooks for fine-grained control. Our framework enables generation of both short and long motion sequences with detailed instructions, text-guided motion editing, and self-correction for improved quality. Evaluations on the HumanML3D dataset demonstrate competitive performance against state-of-the-art methods. Additionally, we create a set of context-rich, compositional, and long text prompts, where user studies show our method significantly outperforms existing approaches.

Paper Structure

This paper contains 54 sections, 6 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: CoMA can generate high quality motion sequences despite challenging user expectations. Label colors red indicate context-rich moves and/or poses, purple indicate spatially compositional motions and gray indicate trajectory-editing instructions.
  • Figure 2: Illustrative architecture comparison between (a) text-conditional motion generation models mdmmldmomaskmmm, (b) keypoint/trajectory-conditional motion editing models gmddnopriormdmmotionfixomnicontrol, (c) Motion-language autoregressive models motiongptmotionchainmotionagent, (e) LLM-grounded motion generation models remodiffusefinemogencomo and (d) our CoMA framework.
  • Figure 3: A real example of how our CoMA workflow generates context-rich, compositional and long motion sequence given only text prompt. More detailed explanations on this example are in Appendix. \ref{['app_subsec:zoom_in_example']}
  • Figure 4: SPAM overview. (a) Motion sequence is decomposed into four body parts: left upper (LU), right upper (RU), left lower (LL), and right lower (RL). Each part is tokenized through separate RVQs and reconstructed into a whole-body motion through a shared decoder. (b) Base-layer motion tokens are randomly masked, while local/global text prompts are encoded separately and concatenated with corresponding motion tokens. The Masked SPAM Transformer is trained to predict the masked tokens. The residual transformer follows a similar architecture and is omitted for brevity.
  • Figure 5: Editing abilities of CoMA and MMM
  • ...and 6 more figures