Table of Contents
Fetching ...

AToM: Aligning Text-to-Motion Model at Event-Level with GPT-4Vision Reward

Haonan Han, Xiangzuo Wu, Huan Liao, Zunnan Xu, Zhongyuan Hu, Ronghui Li, Yachao Zhang, Xiu Li

TL;DR

AToM introduces an event-level alignment framework for text-to-motion generation by leveraging GPT-4Vision rewards. It builds MotionPrefer, a large-scale, fine-grained preference dataset, and uses a GPT-4V-based reward paradigm to annotate alignment scores, followed by IPO/RLHF-style fine-tuning with LoRA on an off-the-shelf motion generator. The approach achieves substantial improvements in event-level alignment, motion quality, and generation diversity compared to prior baselines, validated by quantitative metrics and human judgments. The work demonstrates the viability of LVLM-based feedback to scale alignment for multimodal generation tasks, reducing reliance on manual labeling and enabling finer-grained control over sequence-level descriptions.

Abstract

Recently, text-to-motion models have opened new possibilities for creating realistic human motion with greater efficiency and flexibility. However, aligning motion generation with event-level textual descriptions presents unique challenges due to the complex relationship between textual prompts and desired motion outcomes. To address this, we introduce AToM, a framework that enhances the alignment between generated motion and text prompts by leveraging reward from GPT-4Vision. AToM comprises three main stages: Firstly, we construct a dataset MotionPrefer that pairs three types of event-level textual prompts with generated motions, which cover the integrity, temporal relationship and frequency of motion. Secondly, we design a paradigm that utilizes GPT-4Vision for detailed motion annotation, including visual data formatting, task-specific instructions and scoring rules for each sub-task. Finally, we fine-tune an existing text-to-motion model using reinforcement learning guided by this paradigm. Experimental results demonstrate that AToM significantly improves the event-level alignment quality of text-to-motion generation.

AToM: Aligning Text-to-Motion Model at Event-Level with GPT-4Vision Reward

TL;DR

AToM introduces an event-level alignment framework for text-to-motion generation by leveraging GPT-4Vision rewards. It builds MotionPrefer, a large-scale, fine-grained preference dataset, and uses a GPT-4V-based reward paradigm to annotate alignment scores, followed by IPO/RLHF-style fine-tuning with LoRA on an off-the-shelf motion generator. The approach achieves substantial improvements in event-level alignment, motion quality, and generation diversity compared to prior baselines, validated by quantitative metrics and human judgments. The work demonstrates the viability of LVLM-based feedback to scale alignment for multimodal generation tasks, reducing reliance on manual labeling and enabling finer-grained control over sequence-level descriptions.

Abstract

Recently, text-to-motion models have opened new possibilities for creating realistic human motion with greater efficiency and flexibility. However, aligning motion generation with event-level textual descriptions presents unique challenges due to the complex relationship between textual prompts and desired motion outcomes. To address this, we introduce AToM, a framework that enhances the alignment between generated motion and text prompts by leveraging reward from GPT-4Vision. AToM comprises three main stages: Firstly, we construct a dataset MotionPrefer that pairs three types of event-level textual prompts with generated motions, which cover the integrity, temporal relationship and frequency of motion. Secondly, we design a paradigm that utilizes GPT-4Vision for detailed motion annotation, including visual data formatting, task-specific instructions and scoring rules for each sub-task. Finally, we fine-tune an existing text-to-motion model using reinforcement learning guided by this paradigm. Experimental results demonstrate that AToM significantly improves the event-level alignment quality of text-to-motion generation.

Paper Structure

This paper contains 26 sections, 8 equations, 14 figures, 15 tables, 1 algorithm.

Figures (14)

  • Figure 1: Showcases of motion samples for three scenarios. The two motion samples for each scenario were generated based on the prompt above the samples. Moreover, we leverage GPT-4V to compare two motion samples according to the degree of alignment between the motion samples and the input prompt.
  • Figure 2: The framework of AToM. AToM encompasses three stages: (1) A motion generation process using task-specific prompts constructed by LLM; (2) Evaluation of alignment score for text-motion pairs using a predefined reward paradigm based on LVLM; (3) A fine-tuning mechanism based on LoRA and RL strategy that enhances the original motion generator using the dataset MotionPrefer.
  • Figure 3: Generated qualitative samples comparison of pretrained model MotionGPT and finetuned model AToM.
  • Figure 4: Win rates of AToM fine-tuned compared to MotionGPT by human judgments in three tasks.
  • Figure 5: Performance distribution of different reinforcement learning strategies after generative model finetuning.
  • ...and 9 more figures