Table of Contents
Fetching ...

MCM: Multi-condition Motion Synthesis Framework

Zeyu Ling, Bo Han, Yongkang Wongkan, Han Lin, Mohan Kankanhalli, Weidong Geng

TL;DR

A multi-condition HMS framework, termed MCM, based on a dual-branch structure composed of a main branch and a control branch, which effectively extends the applicability of the diffusion model, which is initially predicated solely on textual conditions, to auditory conditions.

Abstract

Conditional human motion synthesis (HMS) aims to generate human motion sequences that conform to specific conditions. Text and audio represent the two predominant modalities employed as HMS control conditions. While existing research has primarily focused on single conditions, the multi-condition human motion synthesis remains underexplored. In this study, we propose a multi-condition HMS framework, termed MCM, based on a dual-branch structure composed of a main branch and a control branch. This framework effectively extends the applicability of the diffusion model, which is initially predicated solely on textual conditions, to auditory conditions. This extension encompasses both music-to-dance and co-speech HMS while preserving the intrinsic quality of motion and the capabilities for semantic association inherent in the original model. Furthermore, we propose the implementation of a Transformer-based diffusion model, designated as MWNet, as the main branch. This model adeptly apprehends the spatial intricacies and inter-joint correlations inherent in motion sequences, facilitated by the integration of multi-wise self-attention modules. Extensive experiments show that our method achieves competitive results in single-condition and multi-condition HMS tasks.

MCM: Multi-condition Motion Synthesis Framework

TL;DR

A multi-condition HMS framework, termed MCM, based on a dual-branch structure composed of a main branch and a control branch, which effectively extends the applicability of the diffusion model, which is initially predicated solely on textual conditions, to auditory conditions.

Abstract

Conditional human motion synthesis (HMS) aims to generate human motion sequences that conform to specific conditions. Text and audio represent the two predominant modalities employed as HMS control conditions. While existing research has primarily focused on single conditions, the multi-condition human motion synthesis remains underexplored. In this study, we propose a multi-condition HMS framework, termed MCM, based on a dual-branch structure composed of a main branch and a control branch. This framework effectively extends the applicability of the diffusion model, which is initially predicated solely on textual conditions, to auditory conditions. This extension encompasses both music-to-dance and co-speech HMS while preserving the intrinsic quality of motion and the capabilities for semantic association inherent in the original model. Furthermore, we propose the implementation of a Transformer-based diffusion model, designated as MWNet, as the main branch. This model adeptly apprehends the spatial intricacies and inter-joint correlations inherent in motion sequences, facilitated by the integration of multi-wise self-attention modules. Extensive experiments show that our method achieves competitive results in single-condition and multi-condition HMS tasks.
Paper Structure (20 sections, 5 equations, 4 figures, 4 tables)

This paper contains 20 sections, 5 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Samples of Multi-Condition Motion synthesis (MCM). MCM can generate human motion across various scenarios: text-to-motion (blue), dance(red), and co-speech(yellow) motion synthesis under multiple conditions. The motion generated by MCM not only conforms to the rhythm of the music and speech but also exhibits consistency with the textual descriptions of dance and gestures.
  • Figure 2: Overview of the simplified two-layer MCM framework. MCM employs a dual-branch structure consisting of the main branch and the control branch. The layer-wise outputs from the control branch are connected to the main branch via bridge modules, which are fully connected layers or 1d-convolutions with parameters initialized to zero. The output of each bridge module is directly added to the input feature vector of corresponding layers in the main branch. The condition encoders encompass several pre-trained feature extractors for different modal conditions.
  • Figure 3: Model architecture for a multi-wise attention block. It incorporates three distinct types of attention modules, which are employed alternately. The symbols "+" and "×" separately represent feature addition and multiplication operations. $T$ symbolizes the length of the input sequence, while $C^g$ and $C^h$ signify the number of channels for the matrices $Q$, $K$, and $V$ after. The split operation means splitting the channels into $g$ groups or $h$ heads. Context represents text condition for cross-attention and is exactly equal to $X$ for time-wise self-attention.
  • Figure 4: Text-sound multi-condition motion synthesis with MCM (MWNet as the main branch). Each sample is obtained using a segment of text and a segment of audio as inputs.