Exploring Text-to-Motion Generation with Human Preference

Jenny Sheng; Matthieu Lin; Andrew Zhao; Kevin Pruvost; Yu-Hui Wen; Yangguang Li; Gao Huang; Yong-Jin Liu

Exploring Text-to-Motion Generation with Human Preference

Jenny Sheng, Matthieu Lin, Andrew Zhao, Kevin Pruvost, Yu-Hui Wen, Yangguang Li, Gao Huang, Yong-Jin Liu

TL;DR

This work addresses data scarcity in text-to-motion generation by learning from human preferences rather than requiring motion capture labels. It annotates 3,528 motion pairs produced by MotionGPT and compares preference-based finetuning strategies, showing that Direct Preference Optimization (DPO) yields stronger alignment with prompts than RLHF or the baseline, with human evaluators preferring DPO outputs. The study analyzes design choices such as regularization, IPO variants, and LoRA, and highlights that most gains come from high-quality preference vs. low-quality signals, while more data yields diminishing returns. Overall, the paper demonstrates that preference-based supervision is a viable, cheaper pathway to improve multimodal text-to-motion systems and provides a practical dataset and methodology for future research.

Abstract

This paper presents an exploration of preference learning in text-to-motion generation. We find that current improvements in text-to-motion generation still rely on datasets requiring expert labelers with motion capture systems. Instead, learning from human preference data does not require motion capture systems; a labeler with no expertise simply compares two generated motions. This is particularly efficient because evaluating the model's output is easier than gathering the motion that performs a desired task (e.g. backflip). To pioneer the exploration of this paradigm, we annotate 3,528 preference pairs generated by MotionGPT, marking the first effort to investigate various algorithms for learning from preference data. In particular, our exploration highlights important design choices when using preference data. Additionally, our experimental results show that preference learning has the potential to greatly improve current text-to-motion generative models. Our code and dataset are publicly available at https://github.com/THU-LYJ-Lab/InstructMotion}{https://github.com/THU-LYJ-Lab/InstructMotion to further facilitate research in this area.

Exploring Text-to-Motion Generation with Human Preference

TL;DR

Abstract

Paper Structure (11 sections, 11 equations, 5 figures, 4 tables)

This paper contains 11 sections, 11 equations, 5 figures, 4 tables.

Introduction
Related Works
Autoregressive Text-to-Motion Generation
Learning from Human Preferences
Preliminary
Method
Preference Learning
RL with Human Feedback
Direct Preference Optimization
Experiments
Discussion

Figures (5)

Figure 1: Text-to-Motion Generation with Human Preference. We gather preferences over generated completion (i.e., motion) pairs and use them to finetune MotionGPT. In preference learning, the likelihood of preferred completion is increased while that of dispreferred completion is decreased. We explore two types of practical algorithms for preference learning. First, RLHF trains in an online manner; it trains a reward model on the data and uses it to perform RL on MotionGPT. Second, DPO trains in an offline manner with supervised learning; it directly performs MLE on the data. The online/offline aspect is related to whether or not the policy performs exploration, i.e., training on completions outside of the preference dataset.
Figure 2: Screenshot of the Gradio interface for data labeling.
Figure 3: Humans prefer DPO outputs over outputs from MotionGPT. MotionGPT trained on motion data with DPO (in green) has a higher win rate. The win rate is computed on prompts never seen by the model.
Figure 4: Samples with preference degrees "Much better" and "Better" provide most of the performance gains. Adding in "Slightly better" and "Negligibly better/unsure" samples slightly improves alignment but decreases quality.
Figure 5: Model is robust to choices of $\beta$. Values of $\beta$ increasing from 0.05 to 0.20 generally do not impact alignment.

Exploring Text-to-Motion Generation with Human Preference

TL;DR

Abstract

Exploring Text-to-Motion Generation with Human Preference

Authors

TL;DR

Abstract

Table of Contents

Figures (5)