Table of Contents
Fetching ...

Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs

Qi Wu, Yubo Zhao, Yifan Wang, Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang

TL;DR

Motion-Agent introduces a training-efficient, LLM-driven framework for general human motion generation, editing, and understanding. By coupling a lightweight MotionLLM (adapter-tuned) with a fixed motion tokenizer/detokenizer and using GPT-4 as a conversation orchestrator, the approach enables long, multi-turn motion generation and bidirectional text-motion translation without task-specific pretraining. Key contributions include a VQ-VAE–based motion tokenizer, an expanded LLM vocabulary for motion tokens, and an adapter-based translation agent that achieves competitive text-to-motion results and state-of-the-art motion captioning. The method demonstrates strong multi-turn capabilities, smooth motion composition, and broad task versatility, with limitations noted in environment interaction and hand/face detail, pointing to future extensions. Overall, Motion-Agent offers a flexible, scalable pathway to integrate motion-language understanding into interactive, conversational systems.

Abstract

While previous approaches to 3D human motion generation have achieved notable success, they often rely on extensive training and are limited to specific tasks. To address these challenges, we introduce Motion-Agent, an efficient conversational framework designed for general human motion generation, editing, and understanding. Motion-Agent employs an open-source pre-trained language model to develop a generative agent, MotionLLM, that bridges the gap between motion and text. This is accomplished by encoding and quantizing motions into discrete tokens that align with the language model's vocabulary. With only 1--3\% of the model's parameters fine-tuned using adapters, MotionLLM delivers performance on par with diffusion models and other transformer-based methods trained from scratch. By integrating MotionLLM with GPT-4 without additional training, Motion-Agent is able to generate highly complex motion sequences through multi-turn conversations, a capability that previous models have struggled to achieve. Motion-Agent supports a wide range of motion-language tasks, offering versatile capabilities for generating and customizing human motion through interactive conversational exchanges. Project page: https://knoxzhao.github.io/Motion-Agent

Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs

TL;DR

Motion-Agent introduces a training-efficient, LLM-driven framework for general human motion generation, editing, and understanding. By coupling a lightweight MotionLLM (adapter-tuned) with a fixed motion tokenizer/detokenizer and using GPT-4 as a conversation orchestrator, the approach enables long, multi-turn motion generation and bidirectional text-motion translation without task-specific pretraining. Key contributions include a VQ-VAE–based motion tokenizer, an expanded LLM vocabulary for motion tokens, and an adapter-based translation agent that achieves competitive text-to-motion results and state-of-the-art motion captioning. The method demonstrates strong multi-turn capabilities, smooth motion composition, and broad task versatility, with limitations noted in environment interaction and hand/face detail, pointing to future extensions. Overall, Motion-Agent offers a flexible, scalable pathway to integrate motion-language understanding into interactive, conversational systems.

Abstract

While previous approaches to 3D human motion generation have achieved notable success, they often rely on extensive training and are limited to specific tasks. To address these challenges, we introduce Motion-Agent, an efficient conversational framework designed for general human motion generation, editing, and understanding. Motion-Agent employs an open-source pre-trained language model to develop a generative agent, MotionLLM, that bridges the gap between motion and text. This is accomplished by encoding and quantizing motions into discrete tokens that align with the language model's vocabulary. With only 1--3\% of the model's parameters fine-tuned using adapters, MotionLLM delivers performance on par with diffusion models and other transformer-based methods trained from scratch. By integrating MotionLLM with GPT-4 without additional training, Motion-Agent is able to generate highly complex motion sequences through multi-turn conversations, a capability that previous models have struggled to achieve. Motion-Agent supports a wide range of motion-language tasks, offering versatile capabilities for generating and customizing human motion through interactive conversational exchanges. Project page: https://knoxzhao.github.io/Motion-Agent
Paper Structure (28 sections, 5 equations, 11 figures, 7 tables)

This paper contains 28 sections, 5 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Multi-turn Conversation Between User and Motion-Agent. First Turn: Motion Understanding; Second Turn: Motion Generation; Third Turn: Motion Understanding with Previously Generated Motion; Fourth Turn: Motion Editing; Fifth Turn: Continue Motion Generation; Last Turn: Motion Editing on Long Sequence. Note that all turns are continuous.
  • Figure 2: Motion-Agent pipeline. GPT-4 can interact with the translation agent (i.e., MotionLLM) to generate or interpret motions based on input requirements. The generated motion tokens are concatenated and decoded, and the textual caption produced by MotionLLM is returned and processed by GPT-4.
  • Figure 3: Motion-Agent can comprehend abstract, complex user prompts and generate accurate, long motions. It also understands and answers user questions based on real-world knowledge. Notably, the three turns in this figure stem from a continuous conversation, demonstrating the flexibility of its multi-turn capability in scenarios that should not be influenced by previous turns.
  • Figure 4: Comparison with Other Methods. Our Motion-Agent accurately generates motions involving a series of actions, while other models struggle with more complex descriptions like this, resulting in short and unclear motions.
  • Figure 5: Motion-Agent can compose motions with smooth transitions. In this example, the two motions "a person falls down on the back" and "a person is walking" are provided to Motion-Agent in two turns. The system then generates a "stand up" motion to facilitate a seamless composition of the two motions.
  • ...and 6 more figures