Human Motion Synthesis_ A Diffusion Approach for Motion Stitching and In-Betweening
Michael Adewole, Oluwaseyi Giwa, Favour Nerrise, Martins Osifeko, Ajibola Oyedeji
TL;DR
This work tackles motion stitching and in-betweening by framing human motion synthesis as a diffusion process guided by a transformer-based denoiser. The model encodes contextual input poses with a transformer, then iteratively denoises Gaussian-noised frames to generate smooth, realistic motion, reconstructing joint positions via forward kinematics. A composite loss with five components guides training, and evaluation uses FID, Diversity, and Multimodality on AMASS-derived datasets, demonstrating strong in-betweening capabilities. Key limitations include a fixed output length and reduced performance with very small input context, prompting future work to incorporate richer conditioning such as textual descriptions for longer, context-driven sequences.
Abstract
Human motion generation is an important area of research in many fields. In this work, we tackle the problem of motion stitching and in-betweening. Current methods either require manual efforts, or are incapable of handling longer sequences. To address these challenges, we propose a diffusion model with a transformer-based denoiser to generate realistic human motion. Our method demonstrated strong performance in generating in-betweening sequences, transforming a variable number of input poses into smooth and realistic motion sequences consisting of 75 frames at 15 fps, resulting in a total duration of 5 seconds. We present the performance evaluation of our method using quantitative metrics such as Frechet Inception Distance (FID), Diversity, and Multimodality, along with visual assessments of the generated outputs.
