Table of Contents
Fetching ...

SPG: Improving Motion Diffusion by Smooth Perturbation Guidance

Boseong Jeon

TL;DR

SPG tackles the problem of improving motion diffusion outputs without retraining by introducing test-time, model-agnostic weak-model guidance. It builds an aligned weak term through temporal smoothing of the predicted motion, integrated into the denoising process to enhance fidelity while preserving motion structure. Across diverse architectures and tasks, SPG achieves state-of-the-art fidelity, often outperforming CFG in isolation and complementing CFG when combined. The method is simple to implement, requires minimal code changes, and broadens the applicability of diffusion-based motion generation with improved realism and reduced foot-skating, albeit at the cost of extra evaluation time and potential abrupt transitions in some cases.

Abstract

This paper presents a test-time guidance method to improve the output quality of the human motion diffusion models without requiring additional training. To have negative guidance, Smooth Perturbation Guidance (SPG) builds a weak model by temporally smoothing the motion in the denoising steps. Compared to model-agnostic methods originating from the image generation field, SPG effectively mitigates out-of-distribution issues when perturbing motion diffusion models. In SPG guidance, the nature of motion structure remains intact. This work conducts a comprehensive analysis across distinct model architectures and tasks. Despite its extremely simple implementation and no need for additional training requirements, SPG consistently enhances motion fidelity. Project page can be found at https://spg-blind.vercel.app/

SPG: Improving Motion Diffusion by Smooth Perturbation Guidance

TL;DR

SPG tackles the problem of improving motion diffusion outputs without retraining by introducing test-time, model-agnostic weak-model guidance. It builds an aligned weak term through temporal smoothing of the predicted motion, integrated into the denoising process to enhance fidelity while preserving motion structure. Across diverse architectures and tasks, SPG achieves state-of-the-art fidelity, often outperforming CFG in isolation and complementing CFG when combined. The method is simple to implement, requires minimal code changes, and broadens the applicability of diffusion-based motion generation with improved realism and reduced foot-skating, albeit at the cost of extra evaluation time and potential abrupt transitions in some cases.

Abstract

This paper presents a test-time guidance method to improve the output quality of the human motion diffusion models without requiring additional training. To have negative guidance, Smooth Perturbation Guidance (SPG) builds a weak model by temporally smoothing the motion in the denoising steps. Compared to model-agnostic methods originating from the image generation field, SPG effectively mitigates out-of-distribution issues when perturbing motion diffusion models. In SPG guidance, the nature of motion structure remains intact. This work conducts a comprehensive analysis across distinct model architectures and tasks. Despite its extremely simple implementation and no need for additional training requirements, SPG consistently enhances motion fidelity. Project page can be found at https://spg-blind.vercel.app/

Paper Structure

This paper contains 23 sections, 12 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Red: baseline (CFG or no guidance), Green: only SPG applied. With fewer than 10 lines of code changes during test-time inference, SPG improves the realism and reduces foot-skating in motion diffusion models. SPG is a model-agnostic weak-model guidance approach applicable to various networks and tasks.
  • Figure 2: An estimate of how the sampling process deviates from the state manifold huang2024constraineddiffusiontrustsampling. A lower value of $\|\epsilon(\mathbf{x}_t, c)\|$ indicates better confinement within the state manifold. Batch size was set to $10$.
  • Figure 3: Performance for various SPG scales $s$ in (\ref{['eqn: spg']}) and kernel sizes $k$ in (\ref{['eqn: convolution']}). Evaluations were conducted on the HumanML3D test set using the T2M model of MDM. (Top) without CFG, (Bottom) with CFG. Black dash denotes the baseline without any weak model guidnaces. Colored dashed lines were obtained from the original SAG sag implementation on deterministic noise in equation (\ref{['eqn: diffusing from smooth']}) while solid lines correspond to SPG. With a proper scale $k \leq 7$ and $s \geq 0.2$, SPG achieved better result than the baseline for the most metrics. Best viewed in color.
  • Figure 4: Comparison of the weak term $g_{\theta}(\mathbf{x}_{t}, c)$ at the early denosing step. (Left) ICG perturbation caused the loss of the semantic meaning of dancing, resulting in a simple walking motion at the final denoising step. In contrast, the weak term of SPG degraded the motion slightly while keeping the contents. SPG obtained more plausible final motion. (Right) ICG perturbation leads to unstable sliding of the body, while SPG keeps in-manifold of the motion state as it applies the temporal smoothing. See the colored arrows.
  • Figure 5: Speed (top) and acceleration (bottom) comparsion for guidance methods. The joint indexing is based on SMPL 22 joints of HumanML3D.
  • ...and 1 more figures