Table of Contents
Fetching ...

CoMo: Controllable Motion Generation through Language Guided Pose Code Editing

Yiming Huang, Weilin Wan, Yue Yang, Chris Callison-Burch, Mark Yatskar, Lingjie Liu

TL;DR

CoMo is introduced, a Controllable Motion generation model, adept at accurately generating and editing motions by leveraging the knowledge priors of large language models (LLMs) and substantially surpasses previous work in motion editing abilities.

Abstract

Text-to-motion models excel at efficient human motion generation, but existing approaches lack fine-grained controllability over the generation process. Consequently, modifying subtle postures within a motion or inserting new actions at specific moments remains a challenge, limiting the applicability of these methods in diverse scenarios. In light of these challenges, we introduce CoMo, a Controllable Motion generation model, adept at accurately generating and editing motions by leveraging the knowledge priors of large language models (LLMs). Specifically, CoMo decomposes motions into discrete and semantically meaningful pose codes, with each code encapsulating the semantics of a body part, representing elementary information such as "left knee slightly bent". Given textual inputs, CoMo autoregressively generates sequences of pose codes, which are then decoded into 3D motions. Leveraging pose codes as interpretable representations, an LLM can directly intervene in motion editing by adjusting the pose codes according to editing instructions. Experiments demonstrate that CoMo achieves competitive performance in motion generation compared to state-of-the-art models while, in human studies, CoMo substantially surpasses previous work in motion editing abilities.

CoMo: Controllable Motion Generation through Language Guided Pose Code Editing

TL;DR

CoMo is introduced, a Controllable Motion generation model, adept at accurately generating and editing motions by leveraging the knowledge priors of large language models (LLMs) and substantially surpasses previous work in motion editing abilities.

Abstract

Text-to-motion models excel at efficient human motion generation, but existing approaches lack fine-grained controllability over the generation process. Consequently, modifying subtle postures within a motion or inserting new actions at specific moments remains a challenge, limiting the applicability of these methods in diverse scenarios. In light of these challenges, we introduce CoMo, a Controllable Motion generation model, adept at accurately generating and editing motions by leveraging the knowledge priors of large language models (LLMs). Specifically, CoMo decomposes motions into discrete and semantically meaningful pose codes, with each code encapsulating the semantics of a body part, representing elementary information such as "left knee slightly bent". Given textual inputs, CoMo autoregressively generates sequences of pose codes, which are then decoded into 3D motions. Leveraging pose codes as interpretable representations, an LLM can directly intervene in motion editing by adjusting the pose codes according to editing instructions. Experiments demonstrate that CoMo achieves competitive performance in motion generation compared to state-of-the-art models while, in human studies, CoMo substantially surpasses previous work in motion editing abilities.
Paper Structure (22 sections, 5 equations, 14 figures, 10 tables)

This paper contains 22 sections, 5 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: CoMo, a language-guided human motion synthesis model, enables controllable generation from text inputs. CoMo allows for the control of individual body part movements, facilitates fine-grained editing of each joint and frame, and supports iterative editing that preserves the essence of the original motions.
  • Figure 2: Overview of CoMo for text-driven motion generation.Motion Encoder-Decoder (left) utilizes a predefined codebook to encode motions into pose codes and learns a decoder to reconstruct the motions. Motion Generator (right), a transformer-based model, predicts pose codes autoregressively, conditioned on the text descriptions and LLM-generated fine-grained keywords. The generated pose codes are then decoded back into motions using the previously trained decoder.
  • Figure 3: Overview of CoMo for Fine-Grained Motion Editing: Given an original motion and an editing instruction, CoMo encodes the motion into pose codes, serving as the context to prompt an LLM. The LLM identifies the target codes for editing based on the instructions and updates the corresponding codes accordingly. These edited codes are then decoded back into motions to satisfy the user's requirements.
  • Figure 4: Qualitative examples of Motion Generation on the HumanML3D test set t2m_guo. The motion sequences progress from left to right. The red boxes identify misalignments between the generated motion sequence and the text description. CoMo achieves competitive results in motion generation compared to T2M-GPT t2mgpt and FineMoGen zhang2023finemogen. More visual results are available in the Appendix.
  • Figure 5: Human preference on Motion Editing by comparing CoMo with T2M-GPT t2mgpt and FineMoGen zhang2023finemogen. We report the scores on five editing types and average results.
  • ...and 9 more figures