PhysDiff: Physics-Guided Human Motion Diffusion Model
Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, Jan Kautz
TL;DR
PhysDiff introduces a physics-guided diffusion framework that integrates a physics-based motion projection into the diffusion sampling loop to enforce physical plausibility in human motion generation. The projection is implemented via a motion imitation policy trained in a physics simulator to mirror denoised steps, yielding physically-consistent motions without retraining the denoiser. Empirical results on text-to-motion and action-to-motion tasks demonstrate state-of-the-art motion quality and substantial reductions in physical artifacts across large datasets, with insights into projection schedules and clear advantages over post-processing. While slower at inference due to the physics simulation, the approach provides a practical, plug-and-play path to physically plausible motion generation.
Abstract
Denoising diffusion models hold great promise for generating diverse and realistic human motions. However, existing motion diffusion models largely disregard the laws of physics in the diffusion process and often generate physically-implausible motions with pronounced artifacts such as floating, foot sliding, and ground penetration. This seriously impacts the quality of generated motions and limits their real-world application. To address this issue, we present a novel physics-guided motion diffusion model (PhysDiff), which incorporates physical constraints into the diffusion process. Specifically, we propose a physics-based motion projection module that uses motion imitation in a physics simulator to project the denoised motion of a diffusion step to a physically-plausible motion. The projected motion is further used in the next diffusion step to guide the denoising diffusion process. Intuitively, the use of physics in our model iteratively pulls the motion toward a physically-plausible space, which cannot be achieved by simple post-processing. Experiments on large-scale human motion datasets show that our approach achieves state-of-the-art motion quality and improves physical plausibility drastically (>78% for all datasets).
