Table of Contents
Fetching ...

PhysDiff: Physics-Guided Human Motion Diffusion Model

Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, Jan Kautz

TL;DR

PhysDiff introduces a physics-guided diffusion framework that integrates a physics-based motion projection into the diffusion sampling loop to enforce physical plausibility in human motion generation. The projection is implemented via a motion imitation policy trained in a physics simulator to mirror denoised steps, yielding physically-consistent motions without retraining the denoiser. Empirical results on text-to-motion and action-to-motion tasks demonstrate state-of-the-art motion quality and substantial reductions in physical artifacts across large datasets, with insights into projection schedules and clear advantages over post-processing. While slower at inference due to the physics simulation, the approach provides a practical, plug-and-play path to physically plausible motion generation.

Abstract

Denoising diffusion models hold great promise for generating diverse and realistic human motions. However, existing motion diffusion models largely disregard the laws of physics in the diffusion process and often generate physically-implausible motions with pronounced artifacts such as floating, foot sliding, and ground penetration. This seriously impacts the quality of generated motions and limits their real-world application. To address this issue, we present a novel physics-guided motion diffusion model (PhysDiff), which incorporates physical constraints into the diffusion process. Specifically, we propose a physics-based motion projection module that uses motion imitation in a physics simulator to project the denoised motion of a diffusion step to a physically-plausible motion. The projected motion is further used in the next diffusion step to guide the denoising diffusion process. Intuitively, the use of physics in our model iteratively pulls the motion toward a physically-plausible space, which cannot be achieved by simple post-processing. Experiments on large-scale human motion datasets show that our approach achieves state-of-the-art motion quality and improves physical plausibility drastically (>78% for all datasets).

PhysDiff: Physics-Guided Human Motion Diffusion Model

TL;DR

PhysDiff introduces a physics-guided diffusion framework that integrates a physics-based motion projection into the diffusion sampling loop to enforce physical plausibility in human motion generation. The projection is implemented via a motion imitation policy trained in a physics simulator to mirror denoised steps, yielding physically-consistent motions without retraining the denoiser. Empirical results on text-to-motion and action-to-motion tasks demonstrate state-of-the-art motion quality and substantial reductions in physical artifacts across large datasets, with insights into projection schedules and clear advantages over post-processing. While slower at inference due to the physics simulation, the approach provides a practical, plug-and-play path to physically plausible motion generation.

Abstract

Denoising diffusion models hold great promise for generating diverse and realistic human motions. However, existing motion diffusion models largely disregard the laws of physics in the diffusion process and often generate physically-implausible motions with pronounced artifacts such as floating, foot sliding, and ground penetration. This seriously impacts the quality of generated motions and limits their real-world application. To address this issue, we present a novel physics-guided motion diffusion model (PhysDiff), which incorporates physical constraints into the diffusion process. Specifically, we propose a physics-based motion projection module that uses motion imitation in a physics simulator to project the denoised motion of a diffusion step to a physically-plausible motion. The projected motion is further used in the next diffusion step to guide the denoising diffusion process. Intuitively, the use of physics in our model iteratively pulls the motion toward a physically-plausible space, which cannot be achieved by simple post-processing. Experiments on large-scale human motion datasets show that our approach achieves state-of-the-art motion quality and improves physical plausibility drastically (>78% for all datasets).
Paper Structure (14 sections, 4 equations, 6 figures, 5 tables, 2 algorithms)

This paper contains 14 sections, 4 equations, 6 figures, 5 tables, 2 algorithms.

Figures (6)

  • Figure 1: Our PhysDiff model generates physically-plausible motions using a physics-based motion projection in the diffusion process, eliminating artifacts such as floating, ground penetration, and foot sliding, often observed with state-of-the-art models.
  • Figure 2: (Left) Performing one physics-based projection step (post-processing) at the end yields unnatural motion since the motion is too physically-implausible to correct. (Right) Our approach solves this issue by iteratively applying physics and diffusion.
  • Figure 3: Overview of PhysDiff. Each physics-guided diffusion step denoises a motion from timestep $t$ to $s$, where physics-based motion projection is used to enforce physical constraints. The projection is achieved using a motion imitation policy to control a character in a physics simulator. A scheduler controls when the physics-based projection is applied. The denoiser can be any motion-denoising network.
  • Figure 4: Visual comparison of PhysDiff against the SOTA, MDM tevet2022human, on HumanML3D, HumanAct12, and UESTC. PhysDiff reduces physical artifacts such as floating and penetration significantly. Please refer to the https://nvlabs.github.io/PhysDiff for more qualitative comparison.
  • Figure 5: Effect of varying the number of physics-based projection steps for text-to-motion generation on HumanML3D guo2022generating.
  • ...and 1 more figures