Table of Contents
Fetching ...

AMD: Autoregressive Motion Diffusion

Bo Han, Hao Peng, Minjing Dong, Yi Ren, Yixuan Shen, Chang Xu

TL;DR

AMD integrates the text prompt at the current timestep with the text prompt and action sequences at the previous timestep as conditional information to predict the current action sequences in an iterative manner, enabling for the first time the generation of high-definition and high-fidelity human motions based on user-defined modality input.

Abstract

Human motion generation aims to produce plausible human motion sequences according to various conditional inputs, such as text or audio. Despite the feasibility of existing methods in generating motion based on short prompts and simple motion patterns, they encounter difficulties when dealing with long prompts or complex motions. The challenges are two-fold: 1) the scarcity of human motion-captured data for long prompts and complex motions. 2) the high diversity of human motions in the temporal domain and the substantial divergence of distributions from conditional modalities, leading to a many-to-many mapping problem when generating motion with complex and long texts. In this work, we address these gaps by 1) elaborating the first dataset pairing long textual descriptions and 3D complex motions (HumanLong3D), and 2) proposing an autoregressive motion diffusion model (AMD). Specifically, AMD integrates the text prompt at the current timestep with the text prompt and action sequences at the previous timestep as conditional information to predict the current action sequences in an iterative manner. Furthermore, we present its generalization for X-to-Motion with "No Modality Left Behind", enabling the generation of high-definition and high-fidelity human motions based on user-defined modality input.

AMD: Autoregressive Motion Diffusion

TL;DR

AMD integrates the text prompt at the current timestep with the text prompt and action sequences at the previous timestep as conditional information to predict the current action sequences in an iterative manner, enabling for the first time the generation of high-definition and high-fidelity human motions based on user-defined modality input.

Abstract

Human motion generation aims to produce plausible human motion sequences according to various conditional inputs, such as text or audio. Despite the feasibility of existing methods in generating motion based on short prompts and simple motion patterns, they encounter difficulties when dealing with long prompts or complex motions. The challenges are two-fold: 1) the scarcity of human motion-captured data for long prompts and complex motions. 2) the high diversity of human motions in the temporal domain and the substantial divergence of distributions from conditional modalities, leading to a many-to-many mapping problem when generating motion with complex and long texts. In this work, we address these gaps by 1) elaborating the first dataset pairing long textual descriptions and 3D complex motions (HumanLong3D), and 2) proposing an autoregressive motion diffusion model (AMD). Specifically, AMD integrates the text prompt at the current timestep with the text prompt and action sequences at the previous timestep as conditional information to predict the current action sequences in an iterative manner. Furthermore, we present its generalization for X-to-Motion with "No Modality Left Behind", enabling the generation of high-definition and high-fidelity human motions based on user-defined modality input.
Paper Structure (17 sections, 5 equations, 4 figures, 3 tables)

This paper contains 17 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of the Autoregressive Motion Diffusion model. Given the current text prompt $S^i$, the last text prompt $S^{i-1}$, and motion $X_0^{i-1}$(green arrow), we first encode the context information (blue block). Then, we feed the input conditions and corrupted motion $X_T^i$ to AMD Mudule (Fig. \ref{['fig: MDM']}) to generate the original cleaned motion $X_0^i$. Afterward, we send the current text prompt $S^{i}$ and motion $X_0^i$ to the next time step. Iteratively, we can obtain motion sequences for long text prompts.
  • Figure 2: AMD Module. The gray blocks denote the denoising process, while the yellow blocks represent the diffusion process. Within the AMD module, they appear in pairs T times (with the exception of the last one).
  • Figure 3: Result for compound motion synthesis (blue: "there is a man doing left smash right cover." yellow: “then he steps forward and turn around”). The part (red line) indicates a discrepancy between the generated motion and the ground truth.
  • Figure 4: Visualization on HumanML3D Dataset.