Table of Contents
Fetching ...

MotionRFT: Unified Reinforcement Fine-Tuning for Text-to-Motion Generation

Xiaofeng Tan, Wanjiang Weng, Hongsong Wang, Fang Zhao, Xin Geng, Liang Wang

Abstract

Text-to-motion generation has advanced with diffusion- and flow-based generative models, yet supervised pretraining remains insufficient to align models with high-level objectives such as semantic consistency, realism, and human preference. Existing post-training methods have key limitations: they (1) target a specific motion representation, such as joints, (2) optimize a particular aspect, such as text-motion alignment, and may compromise other factors; and (3) incur substantial computational overhead, data dependence, and coarse-grained optimization. We present a reinforcement fine-tuning framework that comprises a heterogeneous-representation, multi-dimensional reward model, MotionReward, and an efficient, fine-grained fine-tuning method, EasyTune. To obtain a unified semantics representation, MotionReward maps heterogeneous motions into a shared semantic space anchored by text, enabling multidimensional reward learning; Self-refinement Preference Learning further enhances semantics without additional annotations. For efficient and effective fine-tuning, we identify the recursive gradient dependence across denoising steps as the key bottleneck, and propose EasyTune, which optimizes step-wise rather than over the full trajectory, yielding dense, fine-grained, and memory-efficient updates. Extensive experiments validate the effectiveness of our framework, achieving FID 0.132 at 22.10 GB peak memory for MLD model and saving up to 15.22 GB over DRaFT. It reduces FID by 22.9% on joint-based ACMDM, and achieves a 12.6% R-Precision gain and 23.3% FID improvement on rotation-based HY Motion. Our project page with code is publicly available.

MotionRFT: Unified Reinforcement Fine-Tuning for Text-to-Motion Generation

Abstract

Text-to-motion generation has advanced with diffusion- and flow-based generative models, yet supervised pretraining remains insufficient to align models with high-level objectives such as semantic consistency, realism, and human preference. Existing post-training methods have key limitations: they (1) target a specific motion representation, such as joints, (2) optimize a particular aspect, such as text-motion alignment, and may compromise other factors; and (3) incur substantial computational overhead, data dependence, and coarse-grained optimization. We present a reinforcement fine-tuning framework that comprises a heterogeneous-representation, multi-dimensional reward model, MotionReward, and an efficient, fine-grained fine-tuning method, EasyTune. To obtain a unified semantics representation, MotionReward maps heterogeneous motions into a shared semantic space anchored by text, enabling multidimensional reward learning; Self-refinement Preference Learning further enhances semantics without additional annotations. For efficient and effective fine-tuning, we identify the recursive gradient dependence across denoising steps as the key bottleneck, and propose EasyTune, which optimizes step-wise rather than over the full trajectory, yielding dense, fine-grained, and memory-efficient updates. Extensive experiments validate the effectiveness of our framework, achieving FID 0.132 at 22.10 GB peak memory for MLD model and saving up to 15.22 GB over DRaFT. It reduces FID by 22.9% on joint-based ACMDM, and achieves a 12.6% R-Precision gain and 23.3% FID improvement on rotation-based HY Motion. Our project page with code is publicly available.

Paper Structure

This paper contains 19 sections, 3 theorems, 36 equations, 15 figures, 5 tables.

Key Result

Corollary 1

Given the reverse process in Eq. eq:reverse_process, $\mathbf{x}_{t-1}^\theta = \pi_\theta(\mathbf{x}_t^\theta, t, c)$, the gradient w.r.t diffusion model $\theta$, denoted as $\tfrac{\partial \mathbf{x}^\theta_{t-1}}{\partial \theta}$, can be expressed as:

Figures (15)

  • Figure 1: Comparison of the training costs and generation performance on HumanML3D guo2022generating. (a) Performance comparison of different fine-tuning methods clark2024directlyprabhudesai2023aligningwu2025drtune. (b) Generalization performance across six pre-trained diffusion-based models chen2023executingmotionlcm-v2Dai2025tevet2023humanzhang2022motiondiffuse.
  • Figure 2: Overview of MotionReward, consisting of unified projection, representation, and multiple preference learning.
  • Figure 3: The framework of existing differentiable reward-based methods (left) and our proposed EasyTune (right). Existing methods backpropagate the gradients of the reward model through the overall denoising process, resulting in (1) excessive memory, (2) inefficient, and (3) coarse-grained optimization. In contrast, EasyTune optimizes the diffusion model by directly backpropagating the gradients at each denoising step, overcoming these issues.
  • Figure 4: Gradient norm with respect to denoising steps. Here, $\text{dim}(\cdot)$ denotes the gradient dimension.
  • Figure 5: Similarity between $t$-th step noised and clean motion.
  • ...and 10 more figures

Theorems & Definitions (5)

  • Corollary 1
  • Corollary 2
  • Corollary
  • proof
  • proof