Table of Contents
Fetching ...

FlowMotion: Target-Predictive Conditional Flow Matching for Jitter-Reduced Text-Driven Human Motion Generation

Manolo Canales Cuba, Vinícius do Carmo Melício, João Paulo Gois

TL;DR

FlowMotion tackles jitter and fidelity in text-driven 3D human motion under resource constraints by introducing a target-motion prediction objective within Conditional Flow Matching. It uses a Transformer-CLIP conditioning stack and Euler-based sampling with classifier-free guidance to produce diverse, text-aligned motions at fast speeds, while directly predicting the target motion $x_1$ from intermediate states $x_t$. Empirical results on HumanML3D and KIT show FlowMotion achieves state-of-the-art jitter on KIT and competitive FID, with a favorable Mahalanobis FID-J balance and substantial speed advantages over diffusion-based methods. The approach offers a practical, scalable solution for real-time, high-fidelity motion generation in constrained environments and lays groundwork for future improvements in text fidelity and interactive motion editing.

Abstract

Achieving high-fidelity and temporally smooth 3D human motion generation remains a challenge, particularly within resource-constrained environments. We introduce FlowMotion, a novel method leveraging Conditional Flow Matching (CFM). FlowMotion incorporates a training objective within CFM that focuses on more accurately predicting target motion in 3D human motion generation, resulting in enhanced generation fidelity and temporal smoothness while maintaining the fast synthesis times characteristic of flow-matching-based methods. FlowMotion achieves state-of-the-art jitter performance, achieving the best jitter in the KIT dataset and the second-best jitter in the HumanML3D dataset, and a competitive FID value in both datasets. This combination provides robust and natural motion sequences, offering a promising equilibrium between generation quality and temporal naturalness.

FlowMotion: Target-Predictive Conditional Flow Matching for Jitter-Reduced Text-Driven Human Motion Generation

TL;DR

FlowMotion tackles jitter and fidelity in text-driven 3D human motion under resource constraints by introducing a target-motion prediction objective within Conditional Flow Matching. It uses a Transformer-CLIP conditioning stack and Euler-based sampling with classifier-free guidance to produce diverse, text-aligned motions at fast speeds, while directly predicting the target motion from intermediate states . Empirical results on HumanML3D and KIT show FlowMotion achieves state-of-the-art jitter on KIT and competitive FID, with a favorable Mahalanobis FID-J balance and substantial speed advantages over diffusion-based methods. The approach offers a practical, scalable solution for real-time, high-fidelity motion generation in constrained environments and lays groundwork for future improvements in text fidelity and interactive motion editing.

Abstract

Achieving high-fidelity and temporally smooth 3D human motion generation remains a challenge, particularly within resource-constrained environments. We introduce FlowMotion, a novel method leveraging Conditional Flow Matching (CFM). FlowMotion incorporates a training objective within CFM that focuses on more accurately predicting target motion in 3D human motion generation, resulting in enhanced generation fidelity and temporal smoothness while maintaining the fast synthesis times characteristic of flow-matching-based methods. FlowMotion achieves state-of-the-art jitter performance, achieving the best jitter in the KIT dataset and the second-best jitter in the HumanML3D dataset, and a competitive FID value in both datasets. This combination provides robust and natural motion sequences, offering a promising equilibrium between generation quality and temporal naturalness.

Paper Structure

This paper contains 26 sections, 17 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: FID-Jitter evaluation on the HumanML3D and KIT datasets. Jitter represents the absolute difference from the ground truth Jitter. HumanML3D is represented in light blue, KIT in orange. Stars of the same color denote our results: FlowMotion. FlowMotion proximity to the Ground Truth (GT), indicated in yellow at the origin, highlights its effectiveness in achieving a balance between generation fidelity, measured by Fréchet Inception Distance (FID), and motion smoothness, quantified by lower Jitter values, indicating a generation closer to real-world motion compared to existing methods.
  • Figure 2: Text-Driven Motion Generation: Leveraging the architecture proposed by Tevet et al. tevet2022human, our approach generates motion sequences from an input ${x}_t = ({x}^1_t, {x}^2_t, \dots, {x}^N_t)$, where each ${x}^i_t \in \mathbb{R}^{J \times D}$ denotes the pose of the $i$-th frame. Crucially, this input is derived via conditional flow matching. By processing this input through a Transformer Encoder, the model produces a motion sequence ${x}_1 = ({x}^1_1, {x}^2_1, \dots, {x}^N_1)$.
  • Figure 3: Overview of the training process. In each epoch, the process starts with a sample $x_1$ from the training dataset and a sample $x_0 \sim \mathcal{N}(0, {I})$. An intermediate representation $x_t$ is then determined via linear interpolation between $x_1$ and $x_0$. The region highlighted in red denotes the space of valid human motions. Note that at the beginning of the process, $x_t$ can be found outside this space.
  • Figure 4: Comparison of trajectory generation performance between MDM tevet2022human, MFM hu2023motion, and the FlowMotion Model for circular motion instructions. MDM and MFM exhibit significant errors in trajectory closure. The FlowMotion Model, conversely, generates a trajectory with accurate start and end point coincidence, indicative of superior motion instruction understanding. The FlowMotion Model's improved FID score confirms its enhanced generation fidelity to the intended circular motion.
  • Figure 5: Qualitative comparison of motion sequences generated from the text prompt "a person walks forward in a straight line", comparing MFM hu2023motion, T2M-GPT zhang2023generating, and FlowMotion. The figure displays multiple poses of the character, spaced in time, with wrist and heel trajectories shown to visualize the motion path.
  • ...and 3 more figures