Table of Contents
Fetching ...

From Diffusion To Flow: Efficient Motion Generation In MotionGPT3

Jaymin Ban, JiHong Jeon, SangYeop Jeong

Abstract

Recent text-driven motion generation methods span both discrete token-based approaches and continuous-latent formulations. MotionGPT3 exemplifies the latter paradigm, combining a learned continuous motion latent space with a diffusion-based prior for text-conditioned synthesis. While rectified flow objectives have recently demonstrated favorable convergence and inference-time properties relative to diffusion in image and audio generation, it remains unclear whether these advantages transfer cleanly to the motion generation setting. In this work, we conduct a controlled empirical study comparing diffusion and rectified flow objectives within the MotionGPT3 framework. By holding the model architecture, training protocol, and evaluation setup fixed, we isolate the effect of the generative objective on training dynamics, final performance, and inference efficiency. Experiments on the HumanML3D dataset show that rectified flow converges in fewer training epochs, reaches strong test performance earlier, and matches or exceeds diffusion-based motion quality under identical conditions. Moreover, flow-based priors exhibit stable behavior across a wide range of inference step counts and achieve competitive quality with fewer sampling steps, yielding improved efficiency--quality trade-offs. Overall, our results suggest that several known benefits of rectified flow objectives do extend to continuous-latent text-to-motion generation, highlighting the importance of the training objective choice in motion priors.

From Diffusion To Flow: Efficient Motion Generation In MotionGPT3

Abstract

Recent text-driven motion generation methods span both discrete token-based approaches and continuous-latent formulations. MotionGPT3 exemplifies the latter paradigm, combining a learned continuous motion latent space with a diffusion-based prior for text-conditioned synthesis. While rectified flow objectives have recently demonstrated favorable convergence and inference-time properties relative to diffusion in image and audio generation, it remains unclear whether these advantages transfer cleanly to the motion generation setting. In this work, we conduct a controlled empirical study comparing diffusion and rectified flow objectives within the MotionGPT3 framework. By holding the model architecture, training protocol, and evaluation setup fixed, we isolate the effect of the generative objective on training dynamics, final performance, and inference efficiency. Experiments on the HumanML3D dataset show that rectified flow converges in fewer training epochs, reaches strong test performance earlier, and matches or exceeds diffusion-based motion quality under identical conditions. Moreover, flow-based priors exhibit stable behavior across a wide range of inference step counts and achieve competitive quality with fewer sampling steps, yielding improved efficiency--quality trade-offs. Overall, our results suggest that several known benefits of rectified flow objectives do extend to continuous-latent text-to-motion generation, highlighting the importance of the training objective choice in motion priors.

Paper Structure

This paper contains 21 sections, 4 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of the MotionGPT-3 architecture with alternative motion priors. The dual-stream GPT-2 backbone produces text-conditioned motion latents, which are fed into either a diffusion-based or flow-based motion prior. The and symbols indicate frozen and trainable modules, respectively.
  • Figure 2: Motion-prior-only inference Pareto comparison between diffusion and flow variants, illustrating the trade-off between inference time and generation quality.
  • Figure 3: End to end Pareto comparison between diffusion and flow variants, illustrating the trade-off between end-to-end inference time and generation quality.
  • Figure 4: Validation metrics across training epochs for diffusion- and flow-based variants. We report FID, Matching Score, and R-Precision (R@3). Faded curves indicate raw validation measurements, while solid curves show exponential moving averages computed with a span of five epochs to highlight overall training trends.