Table of Contents
Fetching ...

Shape Conditioned Human Motion Generation with Diffusion Model

Kebing Xue, Hyewon Seo

TL;DR

This work tackles mesh-level human motion generation conditioned on a target body shape. It introduces Shape-conditioned Motion Diffusion (SMD), which represents meshes in the spectral domain via the graph Laplacian and denoises with a Spectral-Temporal Autoencoder (STAE) within a diffusion framework. SMD supports conditioning from both natural language or action classes and a target mesh, achieving competitive text-to-motion and action-to-motion performance while maintaining high shape fidelity, as demonstrated on AMASS-derived datasets. The approach reduces mesh-costs through spectral compression, improves physics-based metrics thanks to direct mesh conditioning, and offers a practical path toward streamlined, shape-consistent character animation and data augmentation.

Abstract

Human motion synthesis is an important task in computer graphics and computer vision. While focusing on various conditioning signals such as text, action class, or audio to guide the generation process, most existing methods utilize skeleton-based pose representation, requiring additional skinning to produce renderable meshes. Given that human motion is a complex interplay of bones, joints, and muscles, considering solely the skeleton for generation may neglect their inherent interdependency, which can limit the variability and precision of the generated results. To address this issue, we propose a Shape-conditioned Motion Diffusion model (SMD), which enables the generation of motion sequences directly in mesh format, conditioned on a specified target mesh. In SMD, the input meshes are transformed into spectral coefficients using graph Laplacian, to efficiently represent meshes. Subsequently, we propose a Spectral-Temporal Autoencoder (STAE) to leverage cross-temporal dependencies within the spectral domain. Extensive experimental evaluations show that SMD not only produces vivid and realistic motions but also achieves competitive performance in text-to-motion and action-to-motion tasks when compared to state-of-the-art methods.

Shape Conditioned Human Motion Generation with Diffusion Model

TL;DR

This work tackles mesh-level human motion generation conditioned on a target body shape. It introduces Shape-conditioned Motion Diffusion (SMD), which represents meshes in the spectral domain via the graph Laplacian and denoises with a Spectral-Temporal Autoencoder (STAE) within a diffusion framework. SMD supports conditioning from both natural language or action classes and a target mesh, achieving competitive text-to-motion and action-to-motion performance while maintaining high shape fidelity, as demonstrated on AMASS-derived datasets. The approach reduces mesh-costs through spectral compression, improves physics-based metrics thanks to direct mesh conditioning, and offers a practical path toward streamlined, shape-consistent character animation and data augmentation.

Abstract

Human motion synthesis is an important task in computer graphics and computer vision. While focusing on various conditioning signals such as text, action class, or audio to guide the generation process, most existing methods utilize skeleton-based pose representation, requiring additional skinning to produce renderable meshes. Given that human motion is a complex interplay of bones, joints, and muscles, considering solely the skeleton for generation may neglect their inherent interdependency, which can limit the variability and precision of the generated results. To address this issue, we propose a Shape-conditioned Motion Diffusion model (SMD), which enables the generation of motion sequences directly in mesh format, conditioned on a specified target mesh. In SMD, the input meshes are transformed into spectral coefficients using graph Laplacian, to efficiently represent meshes. Subsequently, we propose a Spectral-Temporal Autoencoder (STAE) to leverage cross-temporal dependencies within the spectral domain. Extensive experimental evaluations show that SMD not only produces vivid and realistic motions but also achieves competitive performance in text-to-motion and action-to-motion tasks when compared to state-of-the-art methods.
Paper Structure (14 sections, 20 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 14 sections, 20 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Shape conditioned generation results of SMD. $A_1$, $A_2$, and $A_3$ are based on $A$. $B_1$, $B_2$, and $B_3$ are generated when $B$ is the target mesh.
  • Figure 2: Method overview: 1) From each mesh in the motion data, the vertex coordinates in the local frame are transformed into spectral coefficients by using a graph Fourier transformation; 2) The coefficients together with the rotations and translations are used to train the diffusion model; 3) After training, Spectral-Temporal Autoencoder (STAE) generates motion from a random noise, conditioned on a conditioning signal $z_d$ and a target mesh embedding $z_s$, by denoising it iteratively.
  • Figure 3: Reconstruction error as a function of the number of used eigenvectors. SMPL meshes are compressed by applying graph Fourier transform/inverse transform using a certain number of eigenvectors, from left to right we try 128,512,1024,2048 and 6890 eigenvectors, the color corresponds to the distance between the compressed mesh and original mesh.
  • Figure 4: Overview of our Spectral-Temporal Autoencoder (STAE).
  • Figure 5: Average errors in shape consistency within the motion (left) and in comparison with the target (right).