M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

Mingshuang Luo; Ruibing Hou; Zhuo Li; Hong Chang; Zimo Liu; Yaowei Wang; Shiguang Shan

M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

Mingshuang Luo, Ruibing Hou, Zhuo Li, Hong Chang, Zimo Liu, Yaowei Wang, Shiguang Shan

TL;DR

M3GPT introduces a unified framework that unifies motion, text, and music through discrete tokenizers and a shared language-model backbone. By training in three stages—tokenizer learning, modality alignment with joint LLM optimization, and instruction tuning—the model learns connections among six motion-relevant tasks (text-to-motion, motion-to-text, music-to-dance, dance-to-music, motion prediction, motion in-between) using text as a bridging modality. Auxiliary tasks (music-to-text and text-to-dance) and a shared motion/dance tokenizer foster synergy between generation tasks, enhancing fidelity by backpropagating reconstruction signals into the LLM. Experimental results on Motion-X, AIST++, and FineDance show competitive performance and strong zero-shot capabilities, including long-duration and text-conditioned dance generation. The work advances multimodal motion understanding and generation by tightly coupling discrete representations with powerful language-model reasoning.

Abstract

This paper presents M$^3$GPT, an advanced $\textbf{M}$ultimodal, $\textbf{M}$ultitask framework for $\textbf{M}$otion comprehension and generation. M$^3$GPT operates on three fundamental principles. The first focuses on creating a unified representation space for various motion-relevant modalities. We employ discrete vector quantization for multimodal conditional signals, such as text, music and motion/dance, enabling seamless integration into a large language model (LLM) with a single vocabulary. The second involves modeling motion generation directly in the raw motion space. This strategy circumvents the information loss associated with a discrete tokenizer, resulting in more detailed and comprehensive motion generation. Third, M$^3$GPT learns to model the connections and synergies among various motion-relevant tasks. Text, the most familiar and well-understood modality for LLMs, is utilized as a bridge to establish connections between different motion tasks, facilitating mutual reinforcement. To our knowledge, M$^3$GPT is the first model capable of comprehending and generating motions based on multiple signals. Extensive experiments highlight M$^3$GPT's superior performance across various motion-relevant tasks and its powerful zero-shot generalization capabilities for extremely challenging tasks. Project page: \url{https://github.com/luomingshuang/M3GPT}.

M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

TL;DR

Abstract

This paper presents M

GPT, an advanced

ultimodal,

ultitask framework for

otion comprehension and generation. M

GPT operates on three fundamental principles. The first focuses on creating a unified representation space for various motion-relevant modalities. We employ discrete vector quantization for multimodal conditional signals, such as text, music and motion/dance, enabling seamless integration into a large language model (LLM) with a single vocabulary. The second involves modeling motion generation directly in the raw motion space. This strategy circumvents the information loss associated with a discrete tokenizer, resulting in more detailed and comprehensive motion generation. Third, M

GPT learns to model the connections and synergies among various motion-relevant tasks. Text, the most familiar and well-understood modality for LLMs, is utilized as a bridge to establish connections between different motion tasks, facilitating mutual reinforcement. To our knowledge, M

GPT is the first model capable of comprehending and generating motions based on multiple signals. Extensive experiments highlight M

GPT's superior performance across various motion-relevant tasks and its powerful zero-shot generalization capabilities for extremely challenging tasks. Project page: \url{https://github.com/luomingshuang/M3GPT}.

Paper Structure (22 sections, 6 equations, 7 figures, 18 tables)

This paper contains 22 sections, 6 equations, 7 figures, 18 tables.

Introduction
Related Work
Method
Multimodal tokenizers
Language Model Backbone
Training Strategy
Inference M$^3$GPT
Experiments
Experimental setup
Ablation Studies
Comparisons with State-of-the-arts
Evaluation on Zero-Shot Tasks
Conclusion
Text-Motion Alignment Model
Details for Training and Evaluating
...and 7 more sections

Figures (7)

Figure 1: M$^3$GPT can handle core motion comprehension and generation tasks, including text-to-motion, motion-to-text, music-to-dance, dance-to-music, motion prediction, and motion in-between. The motion sequences within the dashed-line areas are masked in the input.
Figure 2: An overview of the M$^3$GPT framework. M$^3$GPT consists of multimodal tokenizers and a motion-aware language model. The training process of M$^3$GPT consists of three stages: multimodal tokenizers training, modality-alignment pre-training, and instruction tuning.
Figure 3: Qualitative results for long-term dance and music-text conditioned dance generation of M$^3$GPT.
Figure 4: Pipeline of Text-Motion Alignment Model. The training of the text-motion alignment model includes two stages: pre-training motion auto-encoder and text-motion contrastive learning.
Figure 5: Tasks for M$^3$GPT pre-training and instruction tuning. Random represents the unconstrained generation of motion/text/music in the corresponding task.
...and 2 more figures

M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

TL;DR

Abstract

M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)