MoDiPO: text-to-motion alignment via AI-feedback-driven Direct Preference Optimization

Massimiliano Pappa; Luca Collorone; Giovanni Ficarra; Indro Spinelli; Fabio Galasso

MoDiPO: text-to-motion alignment via AI-feedback-driven Direct Preference Optimization

Massimiliano Pappa, Luca Collorone, Giovanni Ficarra, Indro Spinelli, Fabio Galasso

TL;DR

MoDiPO addresses the challenge of producing text-conditioned human motion that is both diverse and realistic by aligning text-to-motion diffusion models with AI-synthesized preferences using Direct Preference Optimization. The method introduces Pick-a-Move, a synthetic motion-preference dataset, and demonstrates how AI feedback can replace costly human annotations to train preference models. Across two datasets (HumanML3D and KIT-ML) and two motion bases (MLD/MDM), MoDiPO achieves substantial improvements in Fréchet Inception Distance ($FID$) while preserving $R$-precision and multimodality, indicating more realistic and text-consistent motions. These results advance reliable, scalable text-to-motion generation and provide a dataset and methodology that can drive future research in AI-feedback-driven alignment.

Abstract

Diffusion Models have revolutionized the field of human motion generation by offering exceptional generation quality and fine-grained controllability through natural language conditioning. Their inherent stochasticity, that is the ability to generate various outputs from a single input, is key to their success. However, this diversity should not be unrestricted, as it may lead to unlikely generations. Instead, it should be confined within the boundaries of text-aligned and realistic generations. To address this issue, we propose MoDiPO (Motion Diffusion DPO), a novel methodology that leverages Direct Preference Optimization (DPO) to align text-to-motion models. We streamline the laborious and expensive process of gathering human preferences needed in DPO by leveraging AI feedback instead. This enables us to experiment with novel DPO strategies, using both online and offline generated motion-preference pairs. To foster future research we contribute with a motion-preference dataset which we dub Pick-a-Move. We demonstrate, both qualitatively and quantitatively, that our proposed method yields significantly more realistic motions. In particular, MoDiPO substantially improves Frechet Inception Distance (FID) while retaining the same RPrecision and Multi-Modality performances.

MoDiPO: text-to-motion alignment via AI-feedback-driven Direct Preference Optimization

TL;DR

) while preserving

-precision and multimodality, indicating more realistic and text-consistent motions. These results advance reliable, scalable text-to-motion generation and provide a dataset and methodology that can drive future research in AI-feedback-driven alignment.

Abstract

Paper Structure (23 sections, 7 equations, 3 figures, 4 tables)

This paper contains 23 sections, 7 equations, 3 figures, 4 tables.

Introduction
Related Works
Aligning Models with Human Feedback
Aligning Models with Synthetic Feedback
Text-to-Motion
Background
Latent Human Representation
DDPMs
Latent DDPMs
Methodology
Motion Preferences Dataset
Motion Alignment Pipeline
Experimental Evaluation
Evaluation Metrics
Quantitative Evaluation
...and 8 more sections

Figures (3)

Figure 1: By fine-tuning the models on a synthetic preferential dataset, we enhance the realism of generated motions. The left showcases MoDiPO's motion walking naturally, avoiding specific areas. Similarly, on the right, the jump appears more natural compared to the unaligned model's generation.
Figure 2: MoDiPO Schematics: Starting with the input prompt, we generate a winner-loser pair, which constitutes a sample in our preferential dataset. To do so, the reference model produces $K$ generations based on the same input prompt. These generations are then ranked by the ranker model according to their relevance with the textual input. From these rankings, we select both a set of winners and a set of losers. Finally, we sample from these sets to determine the final pair. This pair is then used to refine the unfrozen target model using DPO.
Figure 3: Qualitative results on HumanML3D. Vanilla MLD is represented by blue motions, while MLD aligned with MoDiPO is represented by yellow motions.

MoDiPO: text-to-motion alignment via AI-feedback-driven Direct Preference Optimization

TL;DR

Abstract

MoDiPO: text-to-motion alignment via AI-feedback-driven Direct Preference Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (3)