Enhancing Motion Variation in Text-to-Motion Models via Pose and Video Conditioned Editing

Clayton Leite; Yu Xiao

Enhancing Motion Variation in Text-to-Motion Models via Pose and Video Conditioned Editing

Clayton Leite, Yu Xiao

TL;DR

This work tackles the limited motion diversity of text-to-motion models caused by data scarcity by introducing pose- and video-conditioned editing to achieve global and local motion edits. The method uses a two-stage training paradigm (embedding-space training followed by diffusion-model fine-tuning) and an inference-time linear blend to integrate base motions with condition-driven edits, enabling unseen motions such as football kicks. A user study with 26 participants shows that the edited motions achieve realism comparable to standard motions represented in training data, while maintaining alignment to the visual or pose conditions. The approach offers data-efficient expansion of motion repertoires and can be integrated with existing diffusion-based text-to-motion systems, advancing practical applications in animation and interactive avatars.

Abstract

Text-to-motion models that generate sequences of human poses from textual descriptions are garnering significant attention. However, due to data scarcity, the range of motions these models can produce is still limited. For instance, current text-to-motion models cannot generate a motion of kicking a football with the instep of the foot, since the training data only includes martial arts kicks. We propose a novel method that uses short video clips or images as conditions to modify existing basic motions. In this approach, the model's understanding of a kick serves as the prior, while the video or image of a football kick acts as the posterior, enabling the generation of the desired motion. By incorporating these additional modalities as conditions, our method can create motions not present in the training set, overcoming the limitations of text-motion datasets. A user study with 26 participants demonstrated that our approach produces unseen motions with realism comparable to commonly represented motions in text-motion datasets (e.g., HumanML3D), such as walking, running, squatting, and kicking.

Enhancing Motion Variation in Text-to-Motion Models via Pose and Video Conditioned Editing

TL;DR

Abstract

Enhancing Motion Variation in Text-to-Motion Models via Pose and Video Conditioned Editing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)