Table of Contents
Fetching ...

Rethinking Diffusion for Text-Driven Human Motion Generation: Redundant Representations, Evaluation, and Masked Autoregression

Zichong Meng, Yiming Xie, Xiaogang Peng, Zeyu Han, Huaizu Jiang

TL;DR

The paper investigates why VQ-based discrete motion methods outperform diffusion-based approaches and identifies motion representation redundancy and evaluation biases as key factors. It proposes a diffusion-based framework that reforms motion representations to essential, 3D continuous features, projects them into a fine-grained latent space via a 1D ResNet AutoEncoder, and employs masked autoregressive diffusion with an autoregressive generation branch for enhanced text-to-motion generation. Robust evaluators focused on essential motion dimensions are introduced to enable fair comparisons. Experiments on KIT-ML and HumanML3D demonstrate state-of-the-art results and validate the contributions of representation reform and autoregressive diffusion, highlighting practical gains in realism, diversity, and alignment with textual prompts.

Abstract

Since 2023, Vector Quantization (VQ)-based discrete generation methods have rapidly dominated human motion generation, primarily surpassing diffusion-based continuous generation methods in standard performance metrics. However, VQ-based methods have inherent limitations. Representing continuous motion data as limited discrete tokens leads to inevitable information loss, reduces the diversity of generated motions, and restricts their ability to function effectively as motion priors or generation guidance. In contrast, the continuous space generation nature of diffusion-based methods makes them well-suited to address these limitations and with even potential for model scalability. In this work, we systematically investigate why current VQ-based methods perform well and explore the limitations of existing diffusion-based methods from the perspective of motion data representation and distribution. Drawing on these insights, we preserve the inherent strengths of a diffusion-based human motion generation model and gradually optimize it with inspiration from VQ-based approaches. Our approach introduces a human motion diffusion model enabled to perform masked autoregression, optimized with a reformed data representation and distribution. Additionally, we propose a more robust evaluation method to assess different approaches. Extensive experiments on various datasets demonstrate our method outperforms previous methods and achieves state-of-the-art performances.

Rethinking Diffusion for Text-Driven Human Motion Generation: Redundant Representations, Evaluation, and Masked Autoregression

TL;DR

The paper investigates why VQ-based discrete motion methods outperform diffusion-based approaches and identifies motion representation redundancy and evaluation biases as key factors. It proposes a diffusion-based framework that reforms motion representations to essential, 3D continuous features, projects them into a fine-grained latent space via a 1D ResNet AutoEncoder, and employs masked autoregressive diffusion with an autoregressive generation branch for enhanced text-to-motion generation. Robust evaluators focused on essential motion dimensions are introduced to enable fair comparisons. Experiments on KIT-ML and HumanML3D demonstrate state-of-the-art results and validate the contributions of representation reform and autoregressive diffusion, highlighting practical gains in realism, diversity, and alignment with textual prompts.

Abstract

Since 2023, Vector Quantization (VQ)-based discrete generation methods have rapidly dominated human motion generation, primarily surpassing diffusion-based continuous generation methods in standard performance metrics. However, VQ-based methods have inherent limitations. Representing continuous motion data as limited discrete tokens leads to inevitable information loss, reduces the diversity of generated motions, and restricts their ability to function effectively as motion priors or generation guidance. In contrast, the continuous space generation nature of diffusion-based methods makes them well-suited to address these limitations and with even potential for model scalability. In this work, we systematically investigate why current VQ-based methods perform well and explore the limitations of existing diffusion-based methods from the perspective of motion data representation and distribution. Drawing on these insights, we preserve the inherent strengths of a diffusion-based human motion generation model and gradually optimize it with inspiration from VQ-based approaches. Our approach introduces a human motion diffusion model enabled to perform masked autoregression, optimized with a reformed data representation and distribution. Additionally, we propose a more robust evaluation method to assess different approaches. Extensive experiments on various datasets demonstrate our method outperforms previous methods and achieves state-of-the-art performances.

Paper Structure

This paper contains 30 sections, 24 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: The FID results on HumanML3D dataset. The bubble size is proportional to the model size. We achieve superior performance and demonstrate model scalability.
  • Figure 2: Code Usage of VQ-VAEs trained with redundancy are more balanced than VQ-VAEs trained with only essential features.
  • Figure 3: Method Overview.(a) The reformed motion sequence is projected into a compact fine-grained latent space through a Motion AutoEncoder. (b) The motion latents $\mathbf{x}^{0:3}$ are processed through a Masked Autoregressive Transformer, where they are either randomly masked (in training) or appended (in inference) with a learnable mask vector (yellow-colored latents). The transformer provides a condition z for the masked positions to the Diffusion MLPs to produce clean latent $\mathbf{x}^{3:4}$ from the noised input. (c) A visual illustration of motion masked autoregressive where masked latents (yellow-colored) can be reordered into a pseudo-position allowing $p(\text{m}'^{3:4}|\mathbf{x}'^{0:2})$ prediction.
  • Figure 4: Visualization Comparison between our method and baseline state-of-the-art methods. Our method generates motion that is more realistic and more accurately follows the fine details of the textual condition.
  • Figure A1: Our Method's Temporal Editing process, including prefix, in-between, and suffix editing. The editing latents (red color) are treated as masked latents (yellow color). The sequence is then input into the generation branch in \ref{['fig:architecture']} to generate edited latents conditioned on the editing textual instruction and non-edit latents (blue color).