MotionChain: Conversational Motion Controllers via Multimodal Prompts
Biao Jiang, Xin Chen, Chi Zhang, Fukun Yin, Zhuoyuan Li, Gang YU, Jiayuan Fan
TL;DR
MotionChain addresses the challenge of multi-turn, context-aware control of continuous human motion by fusing a discrete motion vocabulary (via a VQ-VAE tokenizer), a vision tokenizer, and a vision-motion-aware language model. The method is trained in three stages—motion tokenizer pretraining, visual-to-language alignment, and instruction-tuned multi-turn diffusion of motion prompts—enabling text-, image-, and motion-conditioned generation within a unified vocabulary. It introduces a multi-turn motion conversation dataset and a motion-composition mechanism to generate temporally coherent sequences, achieving state-of-the-art or competitive results on motion reasoning and temporal composition while enabling intuitive, step-by-step task execution for embodied systems. The work demonstrates practical impact for humanoid robotics, game agents, and virtual avatars by enabling natural, quot;chat-likequot; control of motion with multimodal inputs and long-term planning capabilities.
Abstract
Recent advancements in language models have demonstrated their adeptness in conducting multi-turn dialogues and retaining conversational context. However, this proficiency remains largely unexplored in other multimodal generative models, particularly in human motion models. By integrating multi-turn conversations in controlling continuous virtual human movements, generative human motion models can achieve an intuitive and step-by-step process of human task execution for humanoid robotics, game agents, or other embodied systems. In this work, we present MotionChain, a conversational human motion controller to generate continuous and long-term human motion through multimodal prompts. Specifically, MotionChain consists of multi-modal tokenizers that transform various data types such as text, image, and motion, into discrete tokens, coupled with a Vision-Motion-aware Language model. By leveraging large-scale language, vision-language, and vision-motion data to assist motion-related generation tasks, MotionChain thus comprehends each instruction in multi-turn conversation and generates human motions followed by these prompts. Extensive experiments validate the efficacy of MotionChain, demonstrating state-of-the-art performance in conversational motion generation, as well as more intuitive manners of controlling and interacting with virtual humans.
