FlowMotion: Target-Predictive Conditional Flow Matching for Jitter-Reduced Text-Driven Human Motion Generation
Manolo Canales Cuba, Vinícius do Carmo Melício, João Paulo Gois
TL;DR
FlowMotion tackles jitter and fidelity in text-driven 3D human motion under resource constraints by introducing a target-motion prediction objective within Conditional Flow Matching. It uses a Transformer-CLIP conditioning stack and Euler-based sampling with classifier-free guidance to produce diverse, text-aligned motions at fast speeds, while directly predicting the target motion $x_1$ from intermediate states $x_t$. Empirical results on HumanML3D and KIT show FlowMotion achieves state-of-the-art jitter on KIT and competitive FID, with a favorable Mahalanobis FID-J balance and substantial speed advantages over diffusion-based methods. The approach offers a practical, scalable solution for real-time, high-fidelity motion generation in constrained environments and lays groundwork for future improvements in text fidelity and interactive motion editing.
Abstract
Achieving high-fidelity and temporally smooth 3D human motion generation remains a challenge, particularly within resource-constrained environments. We introduce FlowMotion, a novel method leveraging Conditional Flow Matching (CFM). FlowMotion incorporates a training objective within CFM that focuses on more accurately predicting target motion in 3D human motion generation, resulting in enhanced generation fidelity and temporal smoothness while maintaining the fast synthesis times characteristic of flow-matching-based methods. FlowMotion achieves state-of-the-art jitter performance, achieving the best jitter in the KIT dataset and the second-best jitter in the HumanML3D dataset, and a competitive FID value in both datasets. This combination provides robust and natural motion sequences, offering a promising equilibrium between generation quality and temporal naturalness.
