Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs
Qi Wu, Yubo Zhao, Yifan Wang, Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang
TL;DR
Motion-Agent introduces a training-efficient, LLM-driven framework for general human motion generation, editing, and understanding. By coupling a lightweight MotionLLM (adapter-tuned) with a fixed motion tokenizer/detokenizer and using GPT-4 as a conversation orchestrator, the approach enables long, multi-turn motion generation and bidirectional text-motion translation without task-specific pretraining. Key contributions include a VQ-VAE–based motion tokenizer, an expanded LLM vocabulary for motion tokens, and an adapter-based translation agent that achieves competitive text-to-motion results and state-of-the-art motion captioning. The method demonstrates strong multi-turn capabilities, smooth motion composition, and broad task versatility, with limitations noted in environment interaction and hand/face detail, pointing to future extensions. Overall, Motion-Agent offers a flexible, scalable pathway to integrate motion-language understanding into interactive, conversational systems.
Abstract
While previous approaches to 3D human motion generation have achieved notable success, they often rely on extensive training and are limited to specific tasks. To address these challenges, we introduce Motion-Agent, an efficient conversational framework designed for general human motion generation, editing, and understanding. Motion-Agent employs an open-source pre-trained language model to develop a generative agent, MotionLLM, that bridges the gap between motion and text. This is accomplished by encoding and quantizing motions into discrete tokens that align with the language model's vocabulary. With only 1--3\% of the model's parameters fine-tuned using adapters, MotionLLM delivers performance on par with diffusion models and other transformer-based methods trained from scratch. By integrating MotionLLM with GPT-4 without additional training, Motion-Agent is able to generate highly complex motion sequences through multi-turn conversations, a capability that previous models have struggled to achieve. Motion-Agent supports a wide range of motion-language tasks, offering versatile capabilities for generating and customizing human motion through interactive conversational exchanges. Project page: https://knoxzhao.github.io/Motion-Agent
