Table of Contents
Fetching ...

EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion Generation

Wenyang Zhou, Zhiyang Dou, Zeyu Cao, Zhouyingcheng Liao, Jingbo Wang, Wenjia Wang, Yuan Liu, Taku Komura, Wenping Wang, Lingjie Liu

TL;DR

Efficient Motion Diffusion Model (EMDM) achieves real-time motion generation and significantly improves the efficiency of motion diffusion models compared to existing methods while achieving high-quality motion generation.

Abstract

We introduce Efficient Motion Diffusion Model (EMDM) for fast and high-quality human motion generation. Current state-of-the-art generative diffusion models have produced impressive results but struggle to achieve fast generation without sacrificing quality. On the one hand, previous works, like motion latent diffusion, conduct diffusion within a latent space for efficiency, but learning such a latent space can be a non-trivial effort. On the other hand, accelerating generation by naively increasing the sampling step size, e.g., DDIM, often leads to quality degradation as it fails to approximate the complex denoising distribution. To address these issues, we propose EMDM, which captures the complex distribution during multiple sampling steps in the diffusion model, allowing for much fewer sampling steps and significant acceleration in generation. This is achieved by a conditional denoising diffusion GAN to capture multimodal data distributions among arbitrary (and potentially larger) step sizes conditioned on control signals, enabling fewer-step motion sampling with high fidelity and diversity. To minimize undesired motion artifacts, geometric losses are imposed during network learning. As a result, EMDM achieves real-time motion generation and significantly improves the efficiency of motion diffusion models compared to existing methods while achieving high-quality motion generation. Our code will be publicly available upon publication.

EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion Generation

TL;DR

Efficient Motion Diffusion Model (EMDM) achieves real-time motion generation and significantly improves the efficiency of motion diffusion models compared to existing methods while achieving high-quality motion generation.

Abstract

We introduce Efficient Motion Diffusion Model (EMDM) for fast and high-quality human motion generation. Current state-of-the-art generative diffusion models have produced impressive results but struggle to achieve fast generation without sacrificing quality. On the one hand, previous works, like motion latent diffusion, conduct diffusion within a latent space for efficiency, but learning such a latent space can be a non-trivial effort. On the other hand, accelerating generation by naively increasing the sampling step size, e.g., DDIM, often leads to quality degradation as it fails to approximate the complex denoising distribution. To address these issues, we propose EMDM, which captures the complex distribution during multiple sampling steps in the diffusion model, allowing for much fewer sampling steps and significant acceleration in generation. This is achieved by a conditional denoising diffusion GAN to capture multimodal data distributions among arbitrary (and potentially larger) step sizes conditioned on control signals, enabling fewer-step motion sampling with high fidelity and diversity. To minimize undesired motion artifacts, geometric losses are imposed during network learning. As a result, EMDM achieves real-time motion generation and significantly improves the efficiency of motion diffusion models compared to existing methods while achieving high-quality motion generation. Our code will be publicly available upon publication.
Paper Structure (33 sections, 13 equations, 12 figures, 10 tables, 2 algorithms)

This paper contains 33 sections, 13 equations, 12 figures, 10 tables, 2 algorithms.

Figures (12)

  • Figure 1: EMDM produces high-quality human motion aligned with input conditions in a short runtime. The average run time of EMDM in (a) action-to-motion and (b) text-to-motion tasks is $0.02$s and $0.05$s per sequence, respectively. For reference, the corresponding times for MDM mdm2022human are $2.5$s and $12.3$s. We deepen the color of the character with respect to the time step of the sequence. (c) Overall comparison of the inference time costs on the HumanML3D, KIT, and HumanAct12 datasets. For ease of illustration, the Running Time is plotted with a log scale. We compare the running time per frame vs. the FID of SOTA methods.
  • Figure 2: Pipeline of EMDM. We develop condition denoising diffusion GAN to capture the complex denoising distribution of human body motion, allowing a larger sampling step size (Sec. \ref{['sec:emdm']}). During inference, we use a larger sampling step for fast sampling of high-quality motion w.r.t. input condition. The detailed sampling algorithm is given in Alg. \ref{['alg:sample_from_model']}. Note that we ignore the time step $t$ for simplicity.
  • Figure 3: Denoising distribution becomes complex (non-Gaussian) when increasing sampling step sizes for few-step sampling.
  • Figure 4: Qualitative comparison on text-to-motion task. We visualize the generated motions and real references from six text prompts. EMDM achieves the fastest motion generation while delivering high-quality movements that align with the text input.
  • Figure 5: Qualitative comparisons on action-to-motion task.
  • ...and 7 more figures