Table of Contents
Fetching ...

ACMo: Attribute Controllable Motion Generation

Mingjie Wei, Xuemei Xie, Guangming Shi

TL;DR

ACMo tackles precise, multi-attribute control in text-to-motion by decoupling conditioning signals into independent modules: an Attribute Diffusion Model (ADM) for text-to-motion in latent space, a lightweight Motion Adapter for rapid stylized fine-tuning, and Trajectory ControlNet for spatial trajectory control, complemented by an LLM Planner that maps unseen attributes to dataset texts via zero-shot reasoning. The approach demonstrates competitive motion quality on benchmarks like HumanML3D and 100STYLE, while enabling fine-grained control through motion prompts and robust handling of unseen attributes via the MotionIT dataset. Ablation studies validate the value of decoupled text-motion learning, cross-attention in latent space, and the lightweight finetuning of motion patterns. These contributions offer practical benefits for animation, gaming, VR, and robotics by delivering controllable, multimodal motion generation with efficient adaptation to new styles and trajectories.

Abstract

Attributes such as style, fine-grained text, and trajectory are specific conditions for describing motion. However, existing methods often lack precise user control over motion attributes and suffer from limited generalizability to unseen motions. This work introduces an Attribute Controllable Motion generation architecture, to address these challenges via decouple any conditions and control them separately. Firstly, we explored the Attribute Diffusion Model to imporve text-to-motion performance via decouple text and motion learning, as the controllable model relies heavily on the pre-trained model. Then, we introduce Motion Adpater to quickly finetune previously unseen motion patterns. Its motion prompts inputs achieve multimodal text-to-motion generation that captures user-specified styles. Finally, we propose a LLM Planner to bridge the gap between unseen attributes and dataset-specific texts via local knowledage for user-friendly interaction. Our approach introduces the capability for motion prompts for stylize generation, enabling fine-grained and user-friendly attribute control while providing performance comparable to state-of-the-art methods. Project page: https://mjwei3d.github.io/ACMo/

ACMo: Attribute Controllable Motion Generation

TL;DR

ACMo tackles precise, multi-attribute control in text-to-motion by decoupling conditioning signals into independent modules: an Attribute Diffusion Model (ADM) for text-to-motion in latent space, a lightweight Motion Adapter for rapid stylized fine-tuning, and Trajectory ControlNet for spatial trajectory control, complemented by an LLM Planner that maps unseen attributes to dataset texts via zero-shot reasoning. The approach demonstrates competitive motion quality on benchmarks like HumanML3D and 100STYLE, while enabling fine-grained control through motion prompts and robust handling of unseen attributes via the MotionIT dataset. Ablation studies validate the value of decoupled text-motion learning, cross-attention in latent space, and the lightweight finetuning of motion patterns. These contributions offer practical benefits for animation, gaming, VR, and robotics by delivering controllable, multimodal motion generation with efficient adaptation to new styles and trajectories.

Abstract

Attributes such as style, fine-grained text, and trajectory are specific conditions for describing motion. However, existing methods often lack precise user control over motion attributes and suffer from limited generalizability to unseen motions. This work introduces an Attribute Controllable Motion generation architecture, to address these challenges via decouple any conditions and control them separately. Firstly, we explored the Attribute Diffusion Model to imporve text-to-motion performance via decouple text and motion learning, as the controllable model relies heavily on the pre-trained model. Then, we introduce Motion Adpater to quickly finetune previously unseen motion patterns. Its motion prompts inputs achieve multimodal text-to-motion generation that captures user-specified styles. Finally, we propose a LLM Planner to bridge the gap between unseen attributes and dataset-specific texts via local knowledage for user-friendly interaction. Our approach introduces the capability for motion prompts for stylize generation, enabling fine-grained and user-friendly attribute control while providing performance comparable to state-of-the-art methods. Project page: https://mjwei3d.github.io/ACMo/

Paper Structure

This paper contains 12 sections, 6 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: ACMo handles motions beyond dataset representation, using motion prompts for stylize multimodal generation and multi-attribute control, with LLM Planner mapping zero-shot unseen attributes to dataset texts. Employ rapid fine-tuning to enable the model to recognize new motion patterns. The bracketed text enhances the stability of style and trajectory. Control your motions as you wish!
  • Figure 2: The ACMo network architecture. Stage 1: Attribute Diffusion Model is trained by decoupling text and motion in a more powerful latent space. Stage 2: Motion Adapter finetunes new motion patterns and preserves the original knowledge. Stage 3: Trajectory control through Controlnet. Finally, the LLM Planner module inferences for text processing.
  • Figure 3: To achieve retaining text knowledge, we froze decoupled cross attention. Three fine-tuning methods are carried out, and the method (c) achieved the most efficient learning of motion patterns.
  • Figure 4: Visualization Comparison between the different methods given two distinct text descriptions from HumanML3D testset. Bold and italic denote the verb and attribute, respectively. This visualizes subtle differences at the word level, which illustrates the advantages of our approach.
  • Figure 5: Visualization LLM Planner example. Bold and italics represent key verb and the transformation of the inference context, respectively. The model struggles with understanding context and gets disturbed ( e.g., 'retrieve' as a goal not a generated action) without LLM planner. The LLM understands user instructions and leverages world and local knowledge to convert into dataset text ( e.g., 'through a ditch' $->$ 'forward' and 'retrieve a lost phone' $->$ 'carefully'), enabling effective diffusion model generation.
  • ...and 1 more figures