MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks
Yiming Wu, Wei Ji, Kecheng Zheng, Zicheng Wang, Dong Xu
TL;DR
MoTe tackles the challenge of unified motion-text generation by learning marginal, conditional, and joint distributions in a diffusion framework. It uses MED and TED to embed motions and text into latent spaces and MTDM to perform diffusion across those latents, with a configurable interaction module enabling text-to-motion, motion-to-text, or joint generation from a single model. The key contributions include an in-depth analysis of interaction modules, empirical evidence of a trade-off between text-to-motion and motion-to-text, and state-of-the-art results on text-to-motion on HumanML3D along with competitive motion-to-text performance. This framework provides a versatile baseline for multi-modal motion generation and paves the way for broader multi-modal diffusion systems.
Abstract
Recently, human motion analysis has experienced great improvement due to inspiring generative models such as the denoising diffusion model and large language model. While the existing approaches mainly focus on generating motions with textual descriptions and overlook the reciprocal task. In this paper, we present~\textbf{MoTe}, a unified multi-modal model that could handle diverse tasks by learning the marginal, conditional, and joint distributions of motion and text simultaneously. MoTe enables us to handle the paired text-motion generation, motion captioning, and text-driven motion generation by simply modifying the input context. Specifically, MoTe is composed of three components: Motion Encoder-Decoder (MED), Text Encoder-Decoder (TED), and Moti-on-Text Diffusion Model (MTDM). In particular, MED and TED are trained for extracting latent embeddings, and subsequently reconstructing the motion sequences and textual descriptions from the extracted embeddings, respectively. MTDM, on the other hand, performs an iterative denoising process on the input context to handle diverse tasks. Experimental results on the benchmark datasets demonstrate the superior performance of our proposed method on text-to-motion generation and competitive performance on motion captioning.
