Table of Contents
Fetching ...

MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls

Yuxuan Bian, Ailing Zeng, Xuan Ju, Xian Liu, Zhaoyang Zhang, Wei Liu, Qiang Xu

TL;DR

This paper proposes MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control with state-of-the-art performance on various standard motion generation tasks and introduces MC-Bench, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format.

Abstract

Whole-body multimodal motion generation, controlled by text, speech, or music, has numerous applications including video generation and character animation. However, employing a unified model to achieve various generation tasks with different condition modalities presents two main challenges: motion distribution drifts across different tasks (e.g., co-speech gestures and text-driven daily actions) and the complex optimization of mixed conditions with varying granularities (e.g., text and audio). Additionally, inconsistent motion formats across different tasks and datasets hinder effective training toward multimodal motion generation. In this paper, we propose MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control. Our framework employs a coarse-to-fine training strategy, starting with the first stage of text-to-motion semantic pre-training, followed by the second stage of multimodal low-level control adaptation to handle conditions of varying granularities. To effectively learn and transfer motion knowledge across different distributions, we design MC-Attn for parallel modeling of static and dynamic human topology graphs. To overcome the motion format inconsistency of existing benchmarks, we introduce MC-Bench, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format. Extensive experiments show that MotionCraft achieves state-of-the-art performance on various standard motion generation tasks.

MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls

TL;DR

This paper proposes MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control with state-of-the-art performance on various standard motion generation tasks and introduces MC-Bench, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format.

Abstract

Whole-body multimodal motion generation, controlled by text, speech, or music, has numerous applications including video generation and character animation. However, employing a unified model to achieve various generation tasks with different condition modalities presents two main challenges: motion distribution drifts across different tasks (e.g., co-speech gestures and text-driven daily actions) and the complex optimization of mixed conditions with varying granularities (e.g., text and audio). Additionally, inconsistent motion formats across different tasks and datasets hinder effective training toward multimodal motion generation. In this paper, we propose MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control. Our framework employs a coarse-to-fine training strategy, starting with the first stage of text-to-motion semantic pre-training, followed by the second stage of multimodal low-level control adaptation to handle conditions of varying granularities. To effectively learn and transfer motion knowledge across different distributions, we design MC-Attn for parallel modeling of static and dynamic human topology graphs. To overcome the motion format inconsistency of existing benchmarks, we introduce MC-Bench, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format. Extensive experiments show that MotionCraft achieves state-of-the-art performance on various standard motion generation tasks.
Paper Structure (40 sections, 1 equation, 5 figures, 6 tables)

This paper contains 40 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: We propose MotionCraft, a diffusion transformer that crafts whole-body motion with plug-and-play multimodal controls, encompassing robust motion generation abilities including Text-to-Motion, Speech-to-Gesture, and Music-to-Dance.
  • Figure 2: The t-SNE latent space of motion in different generation tasks. It illustrates the motion distribution drifts across different generation scenarios.
  • Figure 3: Architecture of MotionCraft.MotionCraft is a transformer-based diffusion model. In the first stage, MotionCraft uses text as a semantic control guide to learn coarse-grained cross-scenario motion knowledge across multiple datasets; in the second stage, MotionCraft freezes the backbone while adding a plug-and-play control branch to learn the different low-level control signals. The core of MotionCraft is MC-Attn, which optimizes the representation of motion token sequences by capturing the spatial properties of static and dynamic human topology graphs and learning temporal relationships in parallel.
  • Figure 4: The qualitative results of MotionCraft and other state-of-the-art baselines on three representative tasks, text-to-motion, speech-to-gesture, and music-to-dance. More detailed visualization comparisons are in our supplementary.
  • Figure 5: Multimodal video generation application with our generated motions conditioned on music (upper row) or speech (lower row). We project them to 2D images to serve as motion conditions for MimicMotion mimicmotion.