Table of Contents
Fetching ...

MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks

Yiming Wu, Wei Ji, Kecheng Zheng, Zicheng Wang, Dong Xu

TL;DR

MoTe tackles the challenge of unified motion-text generation by learning marginal, conditional, and joint distributions in a diffusion framework. It uses MED and TED to embed motions and text into latent spaces and MTDM to perform diffusion across those latents, with a configurable interaction module enabling text-to-motion, motion-to-text, or joint generation from a single model. The key contributions include an in-depth analysis of interaction modules, empirical evidence of a trade-off between text-to-motion and motion-to-text, and state-of-the-art results on text-to-motion on HumanML3D along with competitive motion-to-text performance. This framework provides a versatile baseline for multi-modal motion generation and paves the way for broader multi-modal diffusion systems.

Abstract

Recently, human motion analysis has experienced great improvement due to inspiring generative models such as the denoising diffusion model and large language model. While the existing approaches mainly focus on generating motions with textual descriptions and overlook the reciprocal task. In this paper, we present~\textbf{MoTe}, a unified multi-modal model that could handle diverse tasks by learning the marginal, conditional, and joint distributions of motion and text simultaneously. MoTe enables us to handle the paired text-motion generation, motion captioning, and text-driven motion generation by simply modifying the input context. Specifically, MoTe is composed of three components: Motion Encoder-Decoder (MED), Text Encoder-Decoder (TED), and Moti-on-Text Diffusion Model (MTDM). In particular, MED and TED are trained for extracting latent embeddings, and subsequently reconstructing the motion sequences and textual descriptions from the extracted embeddings, respectively. MTDM, on the other hand, performs an iterative denoising process on the input context to handle diverse tasks. Experimental results on the benchmark datasets demonstrate the superior performance of our proposed method on text-to-motion generation and competitive performance on motion captioning.

MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks

TL;DR

MoTe tackles the challenge of unified motion-text generation by learning marginal, conditional, and joint distributions in a diffusion framework. It uses MED and TED to embed motions and text into latent spaces and MTDM to perform diffusion across those latents, with a configurable interaction module enabling text-to-motion, motion-to-text, or joint generation from a single model. The key contributions include an in-depth analysis of interaction modules, empirical evidence of a trade-off between text-to-motion and motion-to-text, and state-of-the-art results on text-to-motion on HumanML3D along with competitive motion-to-text performance. This framework provides a versatile baseline for multi-modal motion generation and paves the way for broader multi-modal diffusion systems.

Abstract

Recently, human motion analysis has experienced great improvement due to inspiring generative models such as the denoising diffusion model and large language model. While the existing approaches mainly focus on generating motions with textual descriptions and overlook the reciprocal task. In this paper, we present~\textbf{MoTe}, a unified multi-modal model that could handle diverse tasks by learning the marginal, conditional, and joint distributions of motion and text simultaneously. MoTe enables us to handle the paired text-motion generation, motion captioning, and text-driven motion generation by simply modifying the input context. Specifically, MoTe is composed of three components: Motion Encoder-Decoder (MED), Text Encoder-Decoder (TED), and Moti-on-Text Diffusion Model (MTDM). In particular, MED and TED are trained for extracting latent embeddings, and subsequently reconstructing the motion sequences and textual descriptions from the extracted embeddings, respectively. MTDM, on the other hand, performs an iterative denoising process on the input context to handle diverse tasks. Experimental results on the benchmark datasets demonstrate the superior performance of our proposed method on text-to-motion generation and competitive performance on motion captioning.

Paper Structure

This paper contains 13 sections, 7 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of diverse tasks: (a) The first row presents the unconditional and joint generation results. Motion, text, and text-motion pairs are generated from Gaussian noise. (b) The second row and third rows indicate conditional generation. Giving text or motion as input, MoTe can perform different tasks like text-to-motion and motion-to-text, as well as variation.
  • Figure 2: The overview of our proposed MoTe. (1) Motion encoder-decoder (MED) and text encoder-decoder (TED) compress the motion sequences and language descriptions into two latent representations, which are reconstructed by the corresponding decoders. We adopt Motion Transformer in MED and CLIP-GPT2 in TED. (2) Motion-text diffusion model (MTDM) maps Gaussian noise through stacked dual path diffusion (DPD) blocks, where each DPD block consists of two unimodal transformer blocks and an interaction module.
  • Figure 3: Three variants of the interaction module: 1) In-Context Interaction, all embeddings are concatenated and then processed by the vanilla transformer block. 2) Cross-Attention Interaction, a multi-head cross-attention layer is inserted for modality interaction. 3) AdaLN Interaction, features are modulated by multiple scale-shift operations with the corresponding weights of which is generated by feeding the summation of timestep embedding and modality feature.
  • Figure 4: (a) Statistics of our user study for evaluating the text-to-motion and motion-to-text tasks. (b) Qualitative comparison on the HumanML3D dataset. We compare our proposed method MoTe with MLD chen2023executing and MotionGPT jiang2023motiongpt on the text-to-motion task, and compare our method with MotionGPT on the motion-to-text task.
  • Figure 5: (a) Comparison of different interaction modules at FID (lower is better) and Bleu@4 (higher is better) on the HumanML3D dataset. (b) Failure cases: 1. precise control; 2. out-of-domain description; 3. word repetition.