Table of Contents
Fetching ...

M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

Mingshuang Luo, Ruibing Hou, Zhuo Li, Hong Chang, Zimo Liu, Yaowei Wang, Shiguang Shan

TL;DR

M3GPT introduces a unified framework that unifies motion, text, and music through discrete tokenizers and a shared language-model backbone. By training in three stages—tokenizer learning, modality alignment with joint LLM optimization, and instruction tuning—the model learns connections among six motion-relevant tasks (text-to-motion, motion-to-text, music-to-dance, dance-to-music, motion prediction, motion in-between) using text as a bridging modality. Auxiliary tasks (music-to-text and text-to-dance) and a shared motion/dance tokenizer foster synergy between generation tasks, enhancing fidelity by backpropagating reconstruction signals into the LLM. Experimental results on Motion-X, AIST++, and FineDance show competitive performance and strong zero-shot capabilities, including long-duration and text-conditioned dance generation. The work advances multimodal motion understanding and generation by tightly coupling discrete representations with powerful language-model reasoning.

Abstract

This paper presents M$^3$GPT, an advanced $\textbf{M}$ultimodal, $\textbf{M}$ultitask framework for $\textbf{M}$otion comprehension and generation. M$^3$GPT operates on three fundamental principles. The first focuses on creating a unified representation space for various motion-relevant modalities. We employ discrete vector quantization for multimodal conditional signals, such as text, music and motion/dance, enabling seamless integration into a large language model (LLM) with a single vocabulary. The second involves modeling motion generation directly in the raw motion space. This strategy circumvents the information loss associated with a discrete tokenizer, resulting in more detailed and comprehensive motion generation. Third, M$^3$GPT learns to model the connections and synergies among various motion-relevant tasks. Text, the most familiar and well-understood modality for LLMs, is utilized as a bridge to establish connections between different motion tasks, facilitating mutual reinforcement. To our knowledge, M$^3$GPT is the first model capable of comprehending and generating motions based on multiple signals. Extensive experiments highlight M$^3$GPT's superior performance across various motion-relevant tasks and its powerful zero-shot generalization capabilities for extremely challenging tasks. Project page: \url{https://github.com/luomingshuang/M3GPT}.

M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

TL;DR

M3GPT introduces a unified framework that unifies motion, text, and music through discrete tokenizers and a shared language-model backbone. By training in three stages—tokenizer learning, modality alignment with joint LLM optimization, and instruction tuning—the model learns connections among six motion-relevant tasks (text-to-motion, motion-to-text, music-to-dance, dance-to-music, motion prediction, motion in-between) using text as a bridging modality. Auxiliary tasks (music-to-text and text-to-dance) and a shared motion/dance tokenizer foster synergy between generation tasks, enhancing fidelity by backpropagating reconstruction signals into the LLM. Experimental results on Motion-X, AIST++, and FineDance show competitive performance and strong zero-shot capabilities, including long-duration and text-conditioned dance generation. The work advances multimodal motion understanding and generation by tightly coupling discrete representations with powerful language-model reasoning.

Abstract

This paper presents MGPT, an advanced ultimodal, ultitask framework for otion comprehension and generation. MGPT operates on three fundamental principles. The first focuses on creating a unified representation space for various motion-relevant modalities. We employ discrete vector quantization for multimodal conditional signals, such as text, music and motion/dance, enabling seamless integration into a large language model (LLM) with a single vocabulary. The second involves modeling motion generation directly in the raw motion space. This strategy circumvents the information loss associated with a discrete tokenizer, resulting in more detailed and comprehensive motion generation. Third, MGPT learns to model the connections and synergies among various motion-relevant tasks. Text, the most familiar and well-understood modality for LLMs, is utilized as a bridge to establish connections between different motion tasks, facilitating mutual reinforcement. To our knowledge, MGPT is the first model capable of comprehending and generating motions based on multiple signals. Extensive experiments highlight MGPT's superior performance across various motion-relevant tasks and its powerful zero-shot generalization capabilities for extremely challenging tasks. Project page: \url{https://github.com/luomingshuang/M3GPT}.
Paper Structure (22 sections, 6 equations, 7 figures, 18 tables)

This paper contains 22 sections, 6 equations, 7 figures, 18 tables.

Figures (7)

  • Figure 1: M$^3$GPT can handle core motion comprehension and generation tasks, including text-to-motion, motion-to-text, music-to-dance, dance-to-music, motion prediction, and motion in-between. The motion sequences within the dashed-line areas are masked in the input.
  • Figure 2: An overview of the M$^3$GPT framework. M$^3$GPT consists of multimodal tokenizers and a motion-aware language model. The training process of M$^3$GPT consists of three stages: multimodal tokenizers training, modality-alignment pre-training, and instruction tuning.
  • Figure 3: Qualitative results for long-term dance and music-text conditioned dance generation of M$^3$GPT.
  • Figure 4: Pipeline of Text-Motion Alignment Model. The training of the text-motion alignment model includes two stages: pre-training motion auto-encoder and text-motion contrastive learning.
  • Figure 5: Tasks for M$^3$GPT pre-training and instruction tuning. Random represents the unconstrained generation of motion/text/music in the corresponding task.
  • ...and 2 more figures