Table of Contents
Fetching ...

Enhancing Motion Variation in Text-to-Motion Models via Pose and Video Conditioned Editing

Clayton Leite, Yu Xiao

TL;DR

This work tackles the limited motion diversity of text-to-motion models caused by data scarcity by introducing pose- and video-conditioned editing to achieve global and local motion edits. The method uses a two-stage training paradigm (embedding-space training followed by diffusion-model fine-tuning) and an inference-time linear blend to integrate base motions with condition-driven edits, enabling unseen motions such as football kicks. A user study with 26 participants shows that the edited motions achieve realism comparable to standard motions represented in training data, while maintaining alignment to the visual or pose conditions. The approach offers data-efficient expansion of motion repertoires and can be integrated with existing diffusion-based text-to-motion systems, advancing practical applications in animation and interactive avatars.

Abstract

Text-to-motion models that generate sequences of human poses from textual descriptions are garnering significant attention. However, due to data scarcity, the range of motions these models can produce is still limited. For instance, current text-to-motion models cannot generate a motion of kicking a football with the instep of the foot, since the training data only includes martial arts kicks. We propose a novel method that uses short video clips or images as conditions to modify existing basic motions. In this approach, the model's understanding of a kick serves as the prior, while the video or image of a football kick acts as the posterior, enabling the generation of the desired motion. By incorporating these additional modalities as conditions, our method can create motions not present in the training set, overcoming the limitations of text-motion datasets. A user study with 26 participants demonstrated that our approach produces unseen motions with realism comparable to commonly represented motions in text-motion datasets (e.g., HumanML3D), such as walking, running, squatting, and kicking.

Enhancing Motion Variation in Text-to-Motion Models via Pose and Video Conditioned Editing

TL;DR

This work tackles the limited motion diversity of text-to-motion models caused by data scarcity by introducing pose- and video-conditioned editing to achieve global and local motion edits. The method uses a two-stage training paradigm (embedding-space training followed by diffusion-model fine-tuning) and an inference-time linear blend to integrate base motions with condition-driven edits, enabling unseen motions such as football kicks. A user study with 26 participants shows that the edited motions achieve realism comparable to standard motions represented in training data, while maintaining alignment to the visual or pose conditions. The approach offers data-efficient expansion of motion repertoires and can be integrated with existing diffusion-based text-to-motion systems, advancing practical applications in animation and interactive avatars.

Abstract

Text-to-motion models that generate sequences of human poses from textual descriptions are garnering significant attention. However, due to data scarcity, the range of motions these models can produce is still limited. For instance, current text-to-motion models cannot generate a motion of kicking a football with the instep of the foot, since the training data only includes martial arts kicks. We propose a novel method that uses short video clips or images as conditions to modify existing basic motions. In this approach, the model's understanding of a kick serves as the prior, while the video or image of a football kick acts as the posterior, enabling the generation of the desired motion. By incorporating these additional modalities as conditions, our method can create motions not present in the training set, overcoming the limitations of text-motion datasets. A user study with 26 participants demonstrated that our approach produces unseen motions with realism comparable to commonly represented motions in text-motion datasets (e.g., HumanML3D), such as walking, running, squatting, and kicking.

Paper Structure

This paper contains 24 sections, 14 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Our method generates new motions from a base motion and an inserted pose or short video clip. For instance, given a base motion of walking lightly (depicted on the left), the user can adjust the body joint angles via a GUI (developed by bodymodelgithub) to insert a pose of barely holding a heavy load. The method then transforms the motion to depict short, burdened steps. Alternatively, a short video clip (2-4 seconds) can be used. Our method can modify the common kick into a football kick using information from a low-quality video of a person kicking a football, despite the absence of such types of kicks in the training dataset (HumanML3D humanml3d). The final motion mimics the motion of kicking a football with the instep of the foot like in the video input.
  • Figure 2: Overview of our method. Green blocks represent data, red blocks process the data, and gray blocks are neural network models or parameters. Ice and fire icons indicate frozen and trainable parameters, respectively.
  • Figure 3: Visualization of some of the motions generated by our method.