Table of Contents
Fetching ...

No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts

Girolamo Macaluso, Lorenzo Mandelli, Mirko Bicchierai, Stefano Berretti, Andrew D. Bagdanov

TL;DR

This work introduces a reinforcement learning-based post-training framework that fine-tunes pretrained motion diffusion models using only textual prompts, guided by a pre-trained text-motion retrieval reward and without any ground-truth motion data. The method uses DDPO with importance sampling and LoRA adapters, coupled with a fast DPM-Solver++ sampler, to achieve data-efficient and efficient adaptation. Across cross-dataset and leave-one-out scenarios on HumanML3D and KIT-ML, the approach yields consistent improvements in semantic alignment and FID while preserving the original distribution and offering privacy-preserving advantages. The results demonstrate the practicality of RL-based post-training for scalable, domain-aware motion synthesis without costly ground-truth data or retraining from scratch.

Abstract

Diffusion models have recently advanced human motion generation, producing realistic and diverse animations from textual prompts. However, adapting these models to unseen actions or styles typically requires additional motion capture data and full retraining, which is costly and difficult to scale. We propose a post-training framework based on Reinforcement Learning that fine-tunes pretrained motion diffusion models using only textual prompts, without requiring any motion ground truth. Our approach employs a pretrained text-motion retrieval network as a reward signal and optimizes the diffusion policy with Denoising Diffusion Policy Optimization, effectively shifting the model's generative distribution toward the target domain without relying on paired motion data. We evaluate our method on cross-dataset adaptation and leave-one-out motion experiments using the HumanML3D and KIT-ML datasets across both latent- and joint-space diffusion architectures. Results from quantitative metrics and user studies show that our approach consistently improves the quality and diversity of generated motions, while preserving performance on the original distribution. Our approach is a flexible, data-efficient, and privacy-preserving solution for motion adaptation.

No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts

TL;DR

This work introduces a reinforcement learning-based post-training framework that fine-tunes pretrained motion diffusion models using only textual prompts, guided by a pre-trained text-motion retrieval reward and without any ground-truth motion data. The method uses DDPO with importance sampling and LoRA adapters, coupled with a fast DPM-Solver++ sampler, to achieve data-efficient and efficient adaptation. Across cross-dataset and leave-one-out scenarios on HumanML3D and KIT-ML, the approach yields consistent improvements in semantic alignment and FID while preserving the original distribution and offering privacy-preserving advantages. The results demonstrate the practicality of RL-based post-training for scalable, domain-aware motion synthesis without costly ground-truth data or retraining from scratch.

Abstract

Diffusion models have recently advanced human motion generation, producing realistic and diverse animations from textual prompts. However, adapting these models to unseen actions or styles typically requires additional motion capture data and full retraining, which is costly and difficult to scale. We propose a post-training framework based on Reinforcement Learning that fine-tunes pretrained motion diffusion models using only textual prompts, without requiring any motion ground truth. Our approach employs a pretrained text-motion retrieval network as a reward signal and optimizes the diffusion policy with Denoising Diffusion Policy Optimization, effectively shifting the model's generative distribution toward the target domain without relying on paired motion data. We evaluate our method on cross-dataset adaptation and leave-one-out motion experiments using the HumanML3D and KIT-ML datasets across both latent- and joint-space diffusion architectures. Results from quantitative metrics and user studies show that our approach consistently improves the quality and diversity of generated motions, while preserving performance on the original distribution. Our approach is a flexible, data-efficient, and privacy-preserving solution for motion adaptation.

Paper Structure

This paper contains 13 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of our fine-tuning procedure. Left: Sample Collection. Diffusion trajectories are generated from Gaussian noise conditioned on prompts sampled from the dataset. At each denoising step, the model outputs a normal distribution from which $\mathbf{x}_{t-1}$ is sampled; the sample and its likelihood $p_{\theta}(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{c})$, along with the timestep, input, and prompt, are stored in the replay buffer. After denoising, the final animation is evaluated by the reward model, which embeds both the prompt and the animation into a joint space and assigns a reward based on their embedding distance. Right: Policy Update. Trajectories are sampled from the replay buffer, likelihoods are recomputed with the current DM, and the model is updated using the DDPO loss.
  • Figure 2: Example of improved text adherence after our fine-tuning of the StableMoFusion model. The figure shows the full animation, with color indicating time from blue to orange. The first row depicts the model before fine-tuning, while the second row shows the model after fine-tuning. After fine-tuning, the generated motions better follow the textual prompts. In particular, in panels (b) and (c), the model fully completes the circular motion, and in panels (a) and (b), the hand movements are more expressive.
  • Figure 3: Perception study results: Human raters evaluated our method against pretrained baseline models in the Human-to-Kit scenario, assessing both motion realism and text adherence in an A/B scenario.