Table of Contents
Fetching ...

M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

Seunggeun Chi, Hyung-gun Chi, Hengbo Ma, Nakul Agarwal, Faizan Siddiqui, Karthik Ramani, Kwonjoon Lee

TL;DR

The paper addresses the challenge of generating long, coherent multi-motion sequences from textual descriptions. It introduces M2D2M, which combines a Motion VQ-VAE to discretize motions into tokens, a discrete diffusion model with a dynamic transition probability that depends on token proximity, and a Two-Phase Sampling strategy to produce smooth, contextually faithful multi-motion sequences from text prompts. A Denoising Transformer, conditioned with CLIP-based action descriptions and augmented by relative positional encoding and classifier-free guidance, enables joint and independent denoising across actions, while a new Jerk metric quantifies transition smoothness. Experiments on HumanML3D and KIT-ML demonstrate state-of-the-art performance for both single-motion and multi-motion generation, highlighting improved realism (FID), fidelity to descriptions (R-Top3, MM-Dist), and smoother transitions, with practical inference characteristics enabled by parallelizable steps. Overall, M2D2M advances text-to-human-motion generation by enabling long, coherent action sequences without requiring multi-motion-specific training, with implications for animation, VR/AR, and human-centric AI interactions, while noting potential privacy and policy considerations.

Abstract

We introduce the Multi-Motion Discrete Diffusion Models (M2D2M), a novel approach for human motion generation from textual descriptions of multiple actions, utilizing the strengths of discrete diffusion models. This approach adeptly addresses the challenge of generating multi-motion sequences, ensuring seamless transitions of motions and coherence across a series of actions. The strength of M2D2M lies in its dynamic transition probability within the discrete diffusion model, which adapts transition probabilities based on the proximity between motion tokens, encouraging mixing between different modes. Complemented by a two-phase sampling strategy that includes independent and joint denoising steps, M2D2M effectively generates long-term, smooth, and contextually coherent human motion sequences, utilizing a model trained for single-motion generation. Extensive experiments demonstrate that M2D2M surpasses current state-of-the-art benchmarks for motion generation from text descriptions, showcasing its efficacy in interpreting language semantics and generating dynamic, realistic motions.

M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

TL;DR

The paper addresses the challenge of generating long, coherent multi-motion sequences from textual descriptions. It introduces M2D2M, which combines a Motion VQ-VAE to discretize motions into tokens, a discrete diffusion model with a dynamic transition probability that depends on token proximity, and a Two-Phase Sampling strategy to produce smooth, contextually faithful multi-motion sequences from text prompts. A Denoising Transformer, conditioned with CLIP-based action descriptions and augmented by relative positional encoding and classifier-free guidance, enables joint and independent denoising across actions, while a new Jerk metric quantifies transition smoothness. Experiments on HumanML3D and KIT-ML demonstrate state-of-the-art performance for both single-motion and multi-motion generation, highlighting improved realism (FID), fidelity to descriptions (R-Top3, MM-Dist), and smoother transitions, with practical inference characteristics enabled by parallelizable steps. Overall, M2D2M advances text-to-human-motion generation by enabling long, coherent action sequences without requiring multi-motion-specific training, with implications for animation, VR/AR, and human-centric AI interactions, while noting potential privacy and policy considerations.

Abstract

We introduce the Multi-Motion Discrete Diffusion Models (M2D2M), a novel approach for human motion generation from textual descriptions of multiple actions, utilizing the strengths of discrete diffusion models. This approach adeptly addresses the challenge of generating multi-motion sequences, ensuring seamless transitions of motions and coherence across a series of actions. The strength of M2D2M lies in its dynamic transition probability within the discrete diffusion model, which adapts transition probabilities based on the proximity between motion tokens, encouraging mixing between different modes. Complemented by a two-phase sampling strategy that includes independent and joint denoising steps, M2D2M effectively generates long-term, smooth, and contextually coherent human motion sequences, utilizing a model trained for single-motion generation. Extensive experiments demonstrate that M2D2M surpasses current state-of-the-art benchmarks for motion generation from text descriptions, showcasing its efficacy in interpreting language semantics and generating dynamic, realistic motions.
Paper Structure (32 sections, 10 equations, 9 figures, 15 tables, 1 algorithm)

This paper contains 32 sections, 10 equations, 9 figures, 15 tables, 1 algorithm.

Figures (9)

  • Figure 1: Qualitative Comparison of Multi-Motion Sequences. In the transitions highlighted by the green boxes, our model shows a consistent and gradual progression of poses compared to others. This indicates that our model not only produces more realistic and smooth motions but also maintains the fidelity of each motion segment, aligning accurately with the corresponding action descriptions on top.
  • Figure 1: Overview of action sentence conditioning of M2D2M. We initially decompose sentences to extract action verbs and subsequently utilize these verbs to construct new sentences. These newly formed sentences then serve as conditions for generating human motion sequences.
  • Figure 2: Overview of M2D2M. We train a (a) VQ-VAE to obtain motion tokens, which is subsequently used to train a (b) Denoising Transformer for the discrete diffusion model. In generating human motion, we follow the (c) standard denoising process for single-motion generation and (d) employ Two-Phase Sampling (TPS) for multi-motion generation. A <MASK> token is denoted as 'M' in the figure.
  • Figure 2: PCA plot representing motion tokens from the codebook of Motion VQ-VAE, visualized in 2D (Left) and 3D (Right) space.
  • Figure 3: Comparison of Multi-Motion Generation Algorithms. Unlike heuristic post-processing methods for combining independent motions such as Handshake shafir2023human and SLERP TEACH:3DV:2022, TPS is a single-stage algorithm for a multi-motion generation that does not require completed individual motions or a hyper-parameter for transition length.
  • ...and 4 more figures