M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models
Seunggeun Chi, Hyung-gun Chi, Hengbo Ma, Nakul Agarwal, Faizan Siddiqui, Karthik Ramani, Kwonjoon Lee
TL;DR
The paper addresses the challenge of generating long, coherent multi-motion sequences from textual descriptions. It introduces M2D2M, which combines a Motion VQ-VAE to discretize motions into tokens, a discrete diffusion model with a dynamic transition probability that depends on token proximity, and a Two-Phase Sampling strategy to produce smooth, contextually faithful multi-motion sequences from text prompts. A Denoising Transformer, conditioned with CLIP-based action descriptions and augmented by relative positional encoding and classifier-free guidance, enables joint and independent denoising across actions, while a new Jerk metric quantifies transition smoothness. Experiments on HumanML3D and KIT-ML demonstrate state-of-the-art performance for both single-motion and multi-motion generation, highlighting improved realism (FID), fidelity to descriptions (R-Top3, MM-Dist), and smoother transitions, with practical inference characteristics enabled by parallelizable steps. Overall, M2D2M advances text-to-human-motion generation by enabling long, coherent action sequences without requiring multi-motion-specific training, with implications for animation, VR/AR, and human-centric AI interactions, while noting potential privacy and policy considerations.
Abstract
We introduce the Multi-Motion Discrete Diffusion Models (M2D2M), a novel approach for human motion generation from textual descriptions of multiple actions, utilizing the strengths of discrete diffusion models. This approach adeptly addresses the challenge of generating multi-motion sequences, ensuring seamless transitions of motions and coherence across a series of actions. The strength of M2D2M lies in its dynamic transition probability within the discrete diffusion model, which adapts transition probabilities based on the proximity between motion tokens, encouraging mixing between different modes. Complemented by a two-phase sampling strategy that includes independent and joint denoising steps, M2D2M effectively generates long-term, smooth, and contextually coherent human motion sequences, utilizing a model trained for single-motion generation. Extensive experiments demonstrate that M2D2M surpasses current state-of-the-art benchmarks for motion generation from text descriptions, showcasing its efficacy in interpreting language semantics and generating dynamic, realistic motions.
