Table of Contents
Fetching ...

Large Motion Model for Unified Multi-Modal Motion Generation

Mingyuan Zhang, Daisheng Jin, Chenyang Gu, Fangzhou Hong, Zhongang Cai, Jingfang Huang, Chongzhi Zhang, Xinying Guo, Lei Yang, Ying He, Ziwei Liu

TL;DR

This work introduces Large Motion Model (LMM), a generalist, multi-modal motion generation model built to unify diverse motion tasks under one framework. It leverages MotionVerse, a mega-scale dataset and TOMATO-based unified representation, along with representation translators to map outputs to dataset-specific formats. LMM combines a transformer-based diffusion backbone with the ArtAttention mechanism and a two-stage pre-training/fine-tuning regime to exploit heterogeneous data and multi-modal signals. Across nine benchmarks, LMM achieves competitive or state-of-the-art performance and demonstrates strong generalization to unseen tasks, while ablations offer insights into training strategies and architectural choices; limitations and societal implications are discussed for future work.

Abstract

Human motion generation, a cornerstone technique in animation and video production, has widespread applications in various tasks like text-to-motion and music-to-dance. Previous works focus on developing specialist models tailored for each task without scalability. In this work, we present Large Motion Model (LMM), a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model. A unified motion model is appealing since it can leverage a wide range of motion data to achieve broad generalization beyond a single task. However, it is also challenging due to the heterogeneous nature of substantially different motion data and tasks. LMM tackles these challenges from three principled aspects: 1) Data: We consolidate datasets with different modalities, formats and tasks into a comprehensive yet unified motion generation dataset, MotionVerse, comprising 10 tasks, 16 datasets, a total of 320k sequences, and 100 million frames. 2) Architecture: We design an articulated attention mechanism ArtAttention that incorporates body part-aware modeling into Diffusion Transformer backbone. 3) Pre-Training: We propose a novel pre-training strategy for LMM, which employs variable frame rates and masking forms, to better exploit knowledge from diverse training data. Extensive experiments demonstrate that our generalist LMM achieves competitive performance across various standard motion generation tasks over state-of-the-art specialist models. Notably, LMM exhibits strong generalization capabilities and emerging properties across many unseen tasks. Additionally, our ablation studies reveal valuable insights about training and scaling up large motion models for future research.

Large Motion Model for Unified Multi-Modal Motion Generation

TL;DR

This work introduces Large Motion Model (LMM), a generalist, multi-modal motion generation model built to unify diverse motion tasks under one framework. It leverages MotionVerse, a mega-scale dataset and TOMATO-based unified representation, along with representation translators to map outputs to dataset-specific formats. LMM combines a transformer-based diffusion backbone with the ArtAttention mechanism and a two-stage pre-training/fine-tuning regime to exploit heterogeneous data and multi-modal signals. Across nine benchmarks, LMM achieves competitive or state-of-the-art performance and demonstrates strong generalization to unseen tasks, while ablations offer insights into training strategies and architectural choices; limitations and societal implications are discussed for future work.

Abstract

Human motion generation, a cornerstone technique in animation and video production, has widespread applications in various tasks like text-to-motion and music-to-dance. Previous works focus on developing specialist models tailored for each task without scalability. In this work, we present Large Motion Model (LMM), a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model. A unified motion model is appealing since it can leverage a wide range of motion data to achieve broad generalization beyond a single task. However, it is also challenging due to the heterogeneous nature of substantially different motion data and tasks. LMM tackles these challenges from three principled aspects: 1) Data: We consolidate datasets with different modalities, formats and tasks into a comprehensive yet unified motion generation dataset, MotionVerse, comprising 10 tasks, 16 datasets, a total of 320k sequences, and 100 million frames. 2) Architecture: We design an articulated attention mechanism ArtAttention that incorporates body part-aware modeling into Diffusion Transformer backbone. 3) Pre-Training: We propose a novel pre-training strategy for LMM, which employs variable frame rates and masking forms, to better exploit knowledge from diverse training data. Extensive experiments demonstrate that our generalist LMM achieves competitive performance across various standard motion generation tasks over state-of-the-art specialist models. Notably, LMM exhibits strong generalization capabilities and emerging properties across many unseen tasks. Additionally, our ablation studies reveal valuable insights about training and scaling up large motion models for future research.
Paper Structure (30 sections, 6 equations, 6 figures, 12 tables)

This paper contains 30 sections, 6 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: We present Large Motion Model (LMM), the first generalist multi-modal motion generation model, that can perform multiple motion generation tasks simultaneously and achieve competitive performance across nine widely used benchmarks.
  • Figure 2: MotionVerse. We preprocess distinct motion-centric datasets into a unified format. As for motion sequences, we initially convert them to the TOMATO lu2023humantomato representation and then further divide them into 10 independent body parts, serving as our unified motion representation. To tackle multi-modal condition signals, we employ ImageBind Girdhar_2023_CVPR to transform them into unified features across modalities.
  • Figure 3: Overall pipeline of LMM.Left: Our two-stage training procedure, including unsupervised pretraining and supervised fine-tuning. Random down-sampling and random mask strategies are applied to enhance knowledge absorption. Right: The generic inference process of LMM. The noised motion sequence and the given context are initially merged before being input into the network. LMM will then synthesize motion sequences, consistent with the provided multi-modal condition signals.
  • Figure 4: Architecture of LMM. LMM is a transformer-based diffusion model. Dataset-dependent Read-In layers and Read-Out layers facilitate the conversion of the motion sequence between our intermediate representation and the latent feature space. In the stem of LMM, ArtAttention refines the feature representations through the spatial and temporal attention branches.
  • Figure 5: Visualization results of LMM-Large. Figure a)-d) show examples of text-driven motion generation. Figure e) and f) show synthesized motion sequences under both textual and musical constraints.
  • ...and 1 more figures