Table of Contents
Fetching ...

MCM: Multi-condition Motion Synthesis Framework for Multi-scenario

Zeyu Ling, Bo Han, Yongkang Wong, Mohan Kangkanhalli, Weidong Geng

TL;DR

This work tackles multi-condition, multi-scenario human motion synthesis by introducing MCM, a two-branch diffusion framework that seamlessly injects multiple conditioning modalities into DDPM-based models without reconfiguring the base architecture. A Transformer-based MWNet with channel-wise self-attention captures spatial inter-joint relationships, and bridge modules enable effective fusion of control signals from the auxiliary branch. Across text-to-motion and music-to-dance tasks, MCM achieves state-of-the-art results in text-driven generation and competitive performance in music-driven generation, while enabling fine-grained multi-condition control such as speech-to-gesture. The approach reduces data curation burdens by allowing cross-domain conditioning without task-specific reengineering, offering practical impact for scalable, flexible motion synthesis across entertainment, simulation, and robotics.

Abstract

The objective of the multi-condition human motion synthesis task is to incorporate diverse conditional inputs, encompassing various forms like text, music, speech, and more. This endows the task with the capability to adapt across multiple scenarios, ranging from text-to-motion and music-to-dance, among others. While existing research has primarily focused on single conditions, the multi-condition human motion generation remains underexplored. In this paper, we address these challenges by introducing MCM, a novel paradigm for motion synthesis that spans multiple scenarios under diverse conditions. The MCM framework is able to integrate with any DDPM-like diffusion model to accommodate multi-conditional information input while preserving its generative capabilities. Specifically, MCM employs two-branch architecture consisting of a main branch and a control branch. The control branch shares the same structure as the main branch and is initialized with the parameters of the main branch, effectively maintaining the generation ability of the main branch and supporting multi-condition input. We also introduce a Transformer-based diffusion model MWNet (DDPM-like) as our main branch that can capture the spatial complexity and inter-joint correlations in motion sequences through a channel-dimension self-attention module. Quantitative comparisons demonstrate that our approach achieves SoTA results in both text-to-motion and competitive results in music-to-dance tasks, comparable to task-specific methods. Furthermore, the qualitative evaluation shows that MCM not only streamlines the adaptation of methodologies originally designed for text-to-motion tasks to domains like music-to-dance and speech-to-gesture, eliminating the need for extensive network re-configurations but also enables effective multi-condition modal control, realizing "once trained is motion need".

MCM: Multi-condition Motion Synthesis Framework for Multi-scenario

TL;DR

This work tackles multi-condition, multi-scenario human motion synthesis by introducing MCM, a two-branch diffusion framework that seamlessly injects multiple conditioning modalities into DDPM-based models without reconfiguring the base architecture. A Transformer-based MWNet with channel-wise self-attention captures spatial inter-joint relationships, and bridge modules enable effective fusion of control signals from the auxiliary branch. Across text-to-motion and music-to-dance tasks, MCM achieves state-of-the-art results in text-driven generation and competitive performance in music-driven generation, while enabling fine-grained multi-condition control such as speech-to-gesture. The approach reduces data curation burdens by allowing cross-domain conditioning without task-specific reengineering, offering practical impact for scalable, flexible motion synthesis across entertainment, simulation, and robotics.

Abstract

The objective of the multi-condition human motion synthesis task is to incorporate diverse conditional inputs, encompassing various forms like text, music, speech, and more. This endows the task with the capability to adapt across multiple scenarios, ranging from text-to-motion and music-to-dance, among others. While existing research has primarily focused on single conditions, the multi-condition human motion generation remains underexplored. In this paper, we address these challenges by introducing MCM, a novel paradigm for motion synthesis that spans multiple scenarios under diverse conditions. The MCM framework is able to integrate with any DDPM-like diffusion model to accommodate multi-conditional information input while preserving its generative capabilities. Specifically, MCM employs two-branch architecture consisting of a main branch and a control branch. The control branch shares the same structure as the main branch and is initialized with the parameters of the main branch, effectively maintaining the generation ability of the main branch and supporting multi-condition input. We also introduce a Transformer-based diffusion model MWNet (DDPM-like) as our main branch that can capture the spatial complexity and inter-joint correlations in motion sequences through a channel-dimension self-attention module. Quantitative comparisons demonstrate that our approach achieves SoTA results in both text-to-motion and competitive results in music-to-dance tasks, comparable to task-specific methods. Furthermore, the qualitative evaluation shows that MCM not only streamlines the adaptation of methodologies originally designed for text-to-motion tasks to domains like music-to-dance and speech-to-gesture, eliminating the need for extensive network re-configurations but also enables effective multi-condition modal control, realizing "once trained is motion need".
Paper Structure (20 sections, 6 equations, 6 figures, 2 tables)

This paper contains 20 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Our MCM method has generated human motion across various scenarios (e.g., text-to-motion or music-to-dance) based on different conditions (e,g, text, music, speech, etc.) By inputting challenging textual descriptions of actions such as kicking a ball, performing forward somersaults, crawling, and more, we have produced highly realistic sequences of movements. MCM is capable of generating motion sequences that not only align with rhythm but also match the dance descriptions(we use a musical note symbol to represent this scene). Additionally, MCM can generate co-speech motions based on speech audio and textual descriptions(a microphone note symbol).
  • Figure 2: MCM framework overview. MCM employs a dual-branch structure consisting of the main branch and the control branch. The layer wise outputs from the control branch are connected to the main branch via bridge modules, which are fully connected layers or 1d-convolutions with parameters initialized to zero. The output of each bridge module is directed added to the input feature vector of corresponding layers in the main branch. The condition encoders encompass several pre-trained feature extractors for different modal conditions. The fully connected layer "in" is responsible for mapping the motion vector to the hidden vector, while the "out" layer performs the opposite mapping.
  • Figure 3: Model architecture for a multi-wise attention block. It uses three types of attention modules alternatively. The symbols "+" and "×" separately represent feature addition and multiplication operation. $T$ symbolizes the length of the input sequence, while $C^g$ and $C^h$ signify the number of channels for the matrices $Q$, $K$, and $V$ after. The split operation means splitting the channels into $g$ groups or $h$ heads. Context represents text condition for cross-attention and is exactly equal to $X$ for time-wise self-attention.
  • Figure 4: Dance genre control with different text prompts. From top to bottom, using the same piece of music, we input text descriptions "A dancer performs Break", "Waack", and "Lock" in addition to music.
  • Figure 5: Dance details control with different text prompts
  • ...and 1 more figures