Table of Contents
Fetching ...

Guided Motion Diffusion for Controllable Human Motion Synthesis

Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwajanakorn, Siyu Tang

TL;DR

This work tackles the challenge of integrating spatial constraints into diffusion-based human motion synthesis by introducing Guided Motion Diffusion (GMD), which couples a scalar goal function $G_x(\cdot)$ with two innovations: Emphasis projection to boost trajectory–pose coherence and Dense signal propagation to transform sparse spatial cues into dense conditioning. GMD enables three spatially guided tasks—trajectory-conditioned generation, keyframe-conditioned generation, and obstacle avoidance—while maintaining strong text-to-motion performance that surpasses state-of-the-art baselines. The approach leverages a two-stage pipeline and a learned denoiser to manage guidance biases and propagate constraints through diffusion steps, achieving coherent, controllable motions without retraining for each spatial objective. These capabilities hold practical impact for animation, gaming, and VR, where real-time or near-real-time controlled motion synthesis in 3D environments is increasingly essential.

Abstract

Denoising diffusion models have shown great promise in human motion synthesis conditioned on natural language descriptions. However, integrating spatial constraints, such as pre-defined motion trajectories and obstacles, remains a challenge despite being essential for bridging the gap between isolated human motion and its surrounding environment. To address this issue, we propose Guided Motion Diffusion (GMD), a method that incorporates spatial constraints into the motion generation process. Specifically, we propose an effective feature projection scheme that manipulates motion representation to enhance the coherency between spatial information and local poses. Together with a new imputation formulation, the generated motion can reliably conform to spatial constraints such as global motion trajectories. Furthermore, given sparse spatial constraints (e.g. sparse keyframes), we introduce a new dense guidance approach to turn a sparse signal, which is susceptible to being ignored during the reverse steps, into denser signals to guide the generated motion to the given constraints. Our extensive experiments justify the development of GMD, which achieves a significant improvement over state-of-the-art methods in text-based motion generation while allowing control of the synthesized motions with spatial constraints.

Guided Motion Diffusion for Controllable Human Motion Synthesis

TL;DR

This work tackles the challenge of integrating spatial constraints into diffusion-based human motion synthesis by introducing Guided Motion Diffusion (GMD), which couples a scalar goal function with two innovations: Emphasis projection to boost trajectory–pose coherence and Dense signal propagation to transform sparse spatial cues into dense conditioning. GMD enables three spatially guided tasks—trajectory-conditioned generation, keyframe-conditioned generation, and obstacle avoidance—while maintaining strong text-to-motion performance that surpasses state-of-the-art baselines. The approach leverages a two-stage pipeline and a learned denoiser to manage guidance biases and propagate constraints through diffusion steps, achieving coherent, controllable motions without retraining for each spatial objective. These capabilities hold practical impact for animation, gaming, and VR, where real-time or near-real-time controlled motion synthesis in 3D environments is increasingly essential.

Abstract

Denoising diffusion models have shown great promise in human motion synthesis conditioned on natural language descriptions. However, integrating spatial constraints, such as pre-defined motion trajectories and obstacles, remains a challenge despite being essential for bridging the gap between isolated human motion and its surrounding environment. To address this issue, we propose Guided Motion Diffusion (GMD), a method that incorporates spatial constraints into the motion generation process. Specifically, we propose an effective feature projection scheme that manipulates motion representation to enhance the coherency between spatial information and local poses. Together with a new imputation formulation, the generated motion can reliably conform to spatial constraints such as global motion trajectories. Furthermore, given sparse spatial constraints (e.g. sparse keyframes), we introduce a new dense guidance approach to turn a sparse signal, which is susceptible to being ignored during the reverse steps, into denser signals to guide the generated motion to the given constraints. Our extensive experiments justify the development of GMD, which achieves a significant improvement over state-of-the-art methods in text-based motion generation while allowing control of the synthesized motions with spatial constraints.
Paper Structure (27 sections, 17 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 27 sections, 17 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Our proposed Guided Motion Diffusion (GMD) can generate high-quality and diverse motions given a text prompt and a goal function. We demonstrate the controllability of GMD on four different tasks, guided by the following conditions: (a) text only, (b) text and trajectory, (c) text and keyframe locations (double circles), and (d) with obstacle avoidance (red-cross areas represent obstacles). The darker the colors, the later in time.
  • Figure 2: We tackle the problem of spatially conditioned motion generation with GMD, depicted in a). Our main contributions are b) Emphasis projection, for better trajectory-motion coherence, and c) Dense signal propagation, for a more controllable generation even under sparse guidance signal.
  • Figure 3: (a) Under standard motion representation and guiding method, only a few values in the motion representation are updated according to the guidance. (b) With Emphasis projection, all values in each frame describing the motion receives gradients w.r.t. the guidance, leading to better coherence between global orientation and local pose in each frame. (c) With dense gradient propagation, all frames are updated according to the guidance at the keyframes, making the guidance less likely to be ignored.
  • Figure 4: Comparing the evolution of the clean trajectory subject to classifier guidance from ${\mathbf{x}_0}$ and $\epsilon$ DPMs. The ${\mathbf{x}_0}$ DPM shows significant resistance on the guidance signal as exhibited by the trajectory "contraction" behavior at $t \rightarrow 0$.
  • Figure 5: Generated motion, conditioned a given trajectory and text "walking forward". MDM Tevet2022-ih exhibits motion incoherence where the model disregards the trajectory and generates an inconsistent motion. Our method, improved by emphasis projection, deals effectively with the conditioning.
  • ...and 5 more figures