Table of Contents
Fetching ...

Human Motion Synthesis_ A Diffusion Approach for Motion Stitching and In-Betweening

Michael Adewole, Oluwaseyi Giwa, Favour Nerrise, Martins Osifeko, Ajibola Oyedeji

TL;DR

This work tackles motion stitching and in-betweening by framing human motion synthesis as a diffusion process guided by a transformer-based denoiser. The model encodes contextual input poses with a transformer, then iteratively denoises Gaussian-noised frames to generate smooth, realistic motion, reconstructing joint positions via forward kinematics. A composite loss with five components guides training, and evaluation uses FID, Diversity, and Multimodality on AMASS-derived datasets, demonstrating strong in-betweening capabilities. Key limitations include a fixed output length and reduced performance with very small input context, prompting future work to incorporate richer conditioning such as textual descriptions for longer, context-driven sequences.

Abstract

Human motion generation is an important area of research in many fields. In this work, we tackle the problem of motion stitching and in-betweening. Current methods either require manual efforts, or are incapable of handling longer sequences. To address these challenges, we propose a diffusion model with a transformer-based denoiser to generate realistic human motion. Our method demonstrated strong performance in generating in-betweening sequences, transforming a variable number of input poses into smooth and realistic motion sequences consisting of 75 frames at 15 fps, resulting in a total duration of 5 seconds. We present the performance evaluation of our method using quantitative metrics such as Frechet Inception Distance (FID), Diversity, and Multimodality, along with visual assessments of the generated outputs.

Human Motion Synthesis_ A Diffusion Approach for Motion Stitching and In-Betweening

TL;DR

This work tackles motion stitching and in-betweening by framing human motion synthesis as a diffusion process guided by a transformer-based denoiser. The model encodes contextual input poses with a transformer, then iteratively denoises Gaussian-noised frames to generate smooth, realistic motion, reconstructing joint positions via forward kinematics. A composite loss with five components guides training, and evaluation uses FID, Diversity, and Multimodality on AMASS-derived datasets, demonstrating strong in-betweening capabilities. Key limitations include a fixed output length and reduced performance with very small input context, prompting future work to incorporate richer conditioning such as textual descriptions for longer, context-driven sequences.

Abstract

Human motion generation is an important area of research in many fields. In this work, we tackle the problem of motion stitching and in-betweening. Current methods either require manual efforts, or are incapable of handling longer sequences. To address these challenges, we propose a diffusion model with a transformer-based denoiser to generate realistic human motion. Our method demonstrated strong performance in generating in-betweening sequences, transforming a variable number of input poses into smooth and realistic motion sequences consisting of 75 frames at 15 fps, resulting in a total duration of 5 seconds. We present the performance evaluation of our method using quantitative metrics such as Frechet Inception Distance (FID), Diversity, and Multimodality, along with visual assessments of the generated outputs.
Paper Structure (18 sections, 6 equations, 5 figures, 1 table)

This paper contains 18 sections, 6 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: The workflow of our approach. Contextual information is extracted from the input poses $c$ using a transformer encoder. The output is used to transform noisy motion data $x_t$ to clean motion $\hat{x}_0^t$ using another transformer encoder. The clean motion goes through the noise schedule to generate the noisy data $x_{t-1}$ for the next timestep. This process is repeated for a predefined number of iterations.
  • Figure 2: Sample output from our model on unseen input motion. The red body indicates the input poses. This has been downsampled for clearer visualization and the ratio of input poses to generated output is maintained.
  • Figure 3: Sample output from our model on unseen input motion. This has been downsampled for clearer visualization. Each frame has been sufficiently spaced to prevent overlapping.
  • Figure 4: Sample output from our model on unseen input motion. Each frame has been sufficiently spaced to prevent overlapping.
  • Figure 5: Sample output from our model on unseen input motion. Each frame has been sufficiently spaced to prevent overlapping. Root position and orientation are not visualized.