Length-Aware Motion Synthesis via Latent Diffusion
Alessio Sampieri, Alessio Palma, Indro Spinelli, Fabio Galasso
TL;DR
This work tackles length-aware text-conditioned 3D human motion synthesis by introducing Length-Aware Latent Diffusion (LADiff), which jointly learns a length-aware VAE and a latent diffusion process that adapts to target sequence duration. The latent space is organized into subspaces that activate with increasing length, and diffusion operates over a variable-dimensional latent representation, aided by a Denoising VAE for robustness. Empirical results on HumanML3D and KIT-ML show state-of-the-art performance across multiple metrics, with ablations confirming the effectiveness of length-aware subspaces, DVAE augmentation, and the overall architecture. The approach enables accurate control of motion duration while preserving textual fidelity and motion realism, offering practical benefits for animators and robotics alike.
Abstract
The target duration of a synthesized human motion is a critical attribute that requires modeling control over the motion dynamics and style. Speeding up an action performance is not merely fast-forwarding it. However, state-of-the-art techniques for human behavior synthesis have limited control over the target sequence length. We introduce the problem of generating length-aware 3D human motion sequences from textual descriptors, and we propose a novel model to synthesize motions of variable target lengths, which we dub "Length-Aware Latent Diffusion" (LADiff). LADiff consists of two new modules: 1) a length-aware variational auto-encoder to learn motion representations with length-dependent latent codes; 2) a length-conforming latent diffusion model to generate motions with a richness of details that increases with the required target sequence length. LADiff significantly improves over the state-of-the-art across most of the existing motion synthesis metrics on the two established benchmarks of HumanML3D and KIT-ML.
