Table of Contents
Fetching ...

DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion

Yunhong Lou, Linchao Zhu, Yaxiong Wang, Xiaohan Wang, Yi Yang

TL;DR

This work tackles the lack of diversity in text-to-motion generation by expanding the motion-caption space with a large Wild Motion-Caption (WMC) dataset and by enriching textual conditioning through Hierarchical Semantic Aggregation (HSA). It introduces Motion Discrete Diffusion (MDD), a diffusion-based framework operating in a discretized motion latent space via Motion VQ-VAE, allowing high-fidelity motions with enhanced diversity. Empirical results on HumanML3D and KIT-ML demonstrate state-of-the-art motion quality and competitive diversity, with ablations validating the contributions of WMC, HSA, and hybrid guidance. The approach provides a practical path toward more varied, semantically faithful human motions driven by natural language prompts, with dataset and models released for reproducibility.

Abstract

We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions while preserving motion diversity.Despite the recent significant process in text-based human motion generation,existing methods often prioritize fitting training motions at the expense of action diversity. Consequently, striking a balance between motion quality and diversity remains an unresolved challenge. This problem is compounded by two key factors: 1) the lack of diversity in motion-caption pairs in existing benchmarks and 2) the unilateral and biased semantic understanding of the text prompt, focusing primarily on the verb component while neglecting the nuanced distinctions indicated by other words.In response to the first issue, we construct a large-scale Wild Motion-Caption dataset (WMC) to extend the restricted action boundary of existing well-annotated datasets, enabling the learning of diverse motions through a more extensive range of actions. To this end, a motion BLIP is trained upon a pretrained vision-language model, then we automatically generate diverse motion captions for the collected motion sequences. As a result, we finally build a dataset comprising 8,888 motions coupled with 141k text.To comprehensively understand the text command, we propose a Hierarchical Semantic Aggregation (HSA) module to capture the fine-grained semantics.Finally,we involve the above two designs into an effective Motion Discrete Diffusion (MDD) framework to strike a balance between motion quality and diversity. Extensive experiments on HumanML3D and KIT-ML show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity. Dataset, code, and pretrained models will be released to reproduce all of our results.

DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion

TL;DR

This work tackles the lack of diversity in text-to-motion generation by expanding the motion-caption space with a large Wild Motion-Caption (WMC) dataset and by enriching textual conditioning through Hierarchical Semantic Aggregation (HSA). It introduces Motion Discrete Diffusion (MDD), a diffusion-based framework operating in a discretized motion latent space via Motion VQ-VAE, allowing high-fidelity motions with enhanced diversity. Empirical results on HumanML3D and KIT-ML demonstrate state-of-the-art motion quality and competitive diversity, with ablations validating the contributions of WMC, HSA, and hybrid guidance. The approach provides a practical path toward more varied, semantically faithful human motions driven by natural language prompts, with dataset and models released for reproducibility.

Abstract

We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions while preserving motion diversity.Despite the recent significant process in text-based human motion generation,existing methods often prioritize fitting training motions at the expense of action diversity. Consequently, striking a balance between motion quality and diversity remains an unresolved challenge. This problem is compounded by two key factors: 1) the lack of diversity in motion-caption pairs in existing benchmarks and 2) the unilateral and biased semantic understanding of the text prompt, focusing primarily on the verb component while neglecting the nuanced distinctions indicated by other words.In response to the first issue, we construct a large-scale Wild Motion-Caption dataset (WMC) to extend the restricted action boundary of existing well-annotated datasets, enabling the learning of diverse motions through a more extensive range of actions. To this end, a motion BLIP is trained upon a pretrained vision-language model, then we automatically generate diverse motion captions for the collected motion sequences. As a result, we finally build a dataset comprising 8,888 motions coupled with 141k text.To comprehensively understand the text command, we propose a Hierarchical Semantic Aggregation (HSA) module to capture the fine-grained semantics.Finally,we involve the above two designs into an effective Motion Discrete Diffusion (MDD) framework to strike a balance between motion quality and diversity. Extensive experiments on HumanML3D and KIT-ML show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity. Dataset, code, and pretrained models will be released to reproduce all of our results.
Paper Structure (17 sections, 12 equations, 7 figures, 5 tables)

This paper contains 17 sections, 12 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Our MDD can generate precise and high-quality motions with the aid of our WMC dataset and HSA text encoder. Darker colors indicate later time.
  • Figure 2: Overview of the method. MDD contains two training stage. a) trains an encoder $\mathcal{E}$, a decoder $\mathcal{D}$ and a codebook $C$ by reconstructing motions. b) trains a motion denoiser $p_\theta(\bm{u}_{t-1}|\bm{u}_t, c)$ to reverse a Markov chain conditioned on text $w$. In the inference stage, the motion denoiser generate motion tokens $u_0$ from fully masked tokens $u_T$ and then we decode $u_0$ to get natural human motion with decoder $\mathcal{D}$.
  • Figure 3: Visual Comparison of the state-of-the-art methods on text-to-motion generation. The colors from light to dark indicate the progression of time. MDD is presented in blue, the prior SOTA is presented in red, and the retrieval results of HumanML3D are presented in yellow. (left) demonstrates that our approach can produce more plausible results for a given prompt. (right) demonstrates some new motion patterns or combinations learned by our approach.
  • Figure 4: Ablation experiments of the scale $s$.
  • Figure 5: Visualization of motion and diverse captions obtained from the WMC dataset.
  • ...and 2 more figures