Table of Contents
Fetching ...

MotionChain: Conversational Motion Controllers via Multimodal Prompts

Biao Jiang, Xin Chen, Chi Zhang, Fukun Yin, Zhuoyuan Li, Gang YU, Jiayuan Fan

TL;DR

MotionChain addresses the challenge of multi-turn, context-aware control of continuous human motion by fusing a discrete motion vocabulary (via a VQ-VAE tokenizer), a vision tokenizer, and a vision-motion-aware language model. The method is trained in three stages—motion tokenizer pretraining, visual-to-language alignment, and instruction-tuned multi-turn diffusion of motion prompts—enabling text-, image-, and motion-conditioned generation within a unified vocabulary. It introduces a multi-turn motion conversation dataset and a motion-composition mechanism to generate temporally coherent sequences, achieving state-of-the-art or competitive results on motion reasoning and temporal composition while enabling intuitive, step-by-step task execution for embodied systems. The work demonstrates practical impact for humanoid robotics, game agents, and virtual avatars by enabling natural, quot;chat-likequot; control of motion with multimodal inputs and long-term planning capabilities.

Abstract

Recent advancements in language models have demonstrated their adeptness in conducting multi-turn dialogues and retaining conversational context. However, this proficiency remains largely unexplored in other multimodal generative models, particularly in human motion models. By integrating multi-turn conversations in controlling continuous virtual human movements, generative human motion models can achieve an intuitive and step-by-step process of human task execution for humanoid robotics, game agents, or other embodied systems. In this work, we present MotionChain, a conversational human motion controller to generate continuous and long-term human motion through multimodal prompts. Specifically, MotionChain consists of multi-modal tokenizers that transform various data types such as text, image, and motion, into discrete tokens, coupled with a Vision-Motion-aware Language model. By leveraging large-scale language, vision-language, and vision-motion data to assist motion-related generation tasks, MotionChain thus comprehends each instruction in multi-turn conversation and generates human motions followed by these prompts. Extensive experiments validate the efficacy of MotionChain, demonstrating state-of-the-art performance in conversational motion generation, as well as more intuitive manners of controlling and interacting with virtual humans.

MotionChain: Conversational Motion Controllers via Multimodal Prompts

TL;DR

MotionChain addresses the challenge of multi-turn, context-aware control of continuous human motion by fusing a discrete motion vocabulary (via a VQ-VAE tokenizer), a vision tokenizer, and a vision-motion-aware language model. The method is trained in three stages—motion tokenizer pretraining, visual-to-language alignment, and instruction-tuned multi-turn diffusion of motion prompts—enabling text-, image-, and motion-conditioned generation within a unified vocabulary. It introduces a multi-turn motion conversation dataset and a motion-composition mechanism to generate temporally coherent sequences, achieving state-of-the-art or competitive results on motion reasoning and temporal composition while enabling intuitive, step-by-step task execution for embodied systems. The work demonstrates practical impact for humanoid robotics, game agents, and virtual avatars by enabling natural, quot;chat-likequot; control of motion with multimodal inputs and long-term planning capabilities.

Abstract

Recent advancements in language models have demonstrated their adeptness in conducting multi-turn dialogues and retaining conversational context. However, this proficiency remains largely unexplored in other multimodal generative models, particularly in human motion models. By integrating multi-turn conversations in controlling continuous virtual human movements, generative human motion models can achieve an intuitive and step-by-step process of human task execution for humanoid robotics, game agents, or other embodied systems. In this work, we present MotionChain, a conversational human motion controller to generate continuous and long-term human motion through multimodal prompts. Specifically, MotionChain consists of multi-modal tokenizers that transform various data types such as text, image, and motion, into discrete tokens, coupled with a Vision-Motion-aware Language model. By leveraging large-scale language, vision-language, and vision-motion data to assist motion-related generation tasks, MotionChain thus comprehends each instruction in multi-turn conversation and generates human motions followed by these prompts. Extensive experiments validate the efficacy of MotionChain, demonstrating state-of-the-art performance in conversational motion generation, as well as more intuitive manners of controlling and interacting with virtual humans.
Paper Structure (28 sections, 7 equations, 12 figures, 12 tables)

This paper contains 28 sections, 7 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: MotionChain can interpret instructions from multi-turn conversations and generate human motions or textual answers based on text, motion, or image inputs. We provide the conversation results in image-conditioned motion generation (1st column), motion reasoning (second column), motion editing (third column), and motion translation (third column), with each subsequent turn informed by all previous conversations. Left-to-right represents the temporal order.
  • Figure 2: Method overview: MotionChain consists of a motion tokenizer $\mathcal{V_M}$ ( \ref{['sec:method:tokenizer']}), a vision tokenize $\mathcal{V_I}$ (r \ref{['sec:method:tokenizer']}) and a vision-motion-aware language model (\ref{['sec:method:lm']}). By leveraging motion tokens generated by $\mathcal{V_M}$, alongside visual language token embeddings projected by vision tokenizer $\mathcal{V_I}$, and text tokens by text tokenizer, MotionChian achieves a unified learning paradigm for both motion and linguistic data.
  • Figure 3: Data collection overview: Our initial step in collecting the motion reasoning data involves the utilization of human motion captions derived from an existing text-motion dataset. Subsequent to this, the text-motion retrieval model TMR petrovich2023tmr aids in the segmentation of motion pairs into categories based on the similarity between them. With the assistance of ChatGPT, we proceed to craft motion editing task data that correspond to these categorized similarity levels. Incorporating both motion reasoning and editing single-turn tasks, as well as the extensive 14 tasks delineated in jiang2023motiongpt, we construct a rich multi-modal multi-turn conversation dataset.
  • Figure 4: Motion Composition Variants: We illustrate the baselines for motion composition during multi-turn motion generation (a). independent decoding each turn (b). separate decoding conditioned on the last few tokens from the prior turn (c). decoding with joint motion tokens. Green tokens stand for image condition, blue tokens stand for textual instruction, and orange tokens stand for human motions.
  • Figure 5: The gallery showcases the results of our MotionChain model. The supervision of MotionChain is based on our conversational motion-language dataset (see Appendix \ref{['sec:appendix:dataset']}), which builds upon previous motion datasets Guo_2022_CVPR_humanml3dBABEL:CVPR:2021. For a more dynamic visualization, we recommend referring to our supplemental video.
  • ...and 7 more figures