Table of Contents
Fetching ...

MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities

Bizhu Wu, Jinheng Xie, Keming Shen, Zhe Kong, Jianfeng Ren, Ruibin Bai, Rong Qu, Linlin Shen

TL;DR

MG-MotionLLM introduces a unified multi-granular motion-language framework by pairing a Motion VQ-VAE with a T5-based motion-aware LLM. A two-stage training regime, Granularity-Synergy Pre-training and Task-Specific Instruction Tuning, enables cross-granularity learning across 28 motion-text tasks, fostering mutual enhancement between coarse and fine-grained representations. Empirical results on HumanML3D and FineMotion demonstrate state-of-the-art performance for text-to-motion and motion-to-text, as well as strong capabilities in fine-grained motion scripting and editing within a single model. This work advances practical, fine-grained control of human motions for AR/VR, gaming, and animation by unifying generation, understanding, and editing under one framework.

Abstract

Recent motion-aware large language models have demonstrated promising potential in unifying motion comprehension and generation. However, existing approaches primarily focus on coarse-grained motion-text modeling, where text describes the overall semantics of an entire motion sequence in just a few words. This limits their ability to handle fine-grained motion-relevant tasks, such as understanding and controlling the movements of specific body parts. To overcome this limitation, we pioneer MG-MotionLLM, a unified motion-language model for multi-granular motion comprehension and generation. We further introduce a comprehensive multi-granularity training scheme by incorporating a set of novel auxiliary tasks, such as localizing temporal boundaries of motion segments via detailed text as well as motion detailed captioning, to facilitate mutual reinforcement for motion-text modeling across various levels of granularity. Extensive experiments show that our MG-MotionLLM achieves superior performance on classical text-to-motion and motion-to-text tasks, and exhibits potential in novel fine-grained motion comprehension and editing tasks. Project page: CVI-SZU/MG-MotionLLM

MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities

TL;DR

MG-MotionLLM introduces a unified multi-granular motion-language framework by pairing a Motion VQ-VAE with a T5-based motion-aware LLM. A two-stage training regime, Granularity-Synergy Pre-training and Task-Specific Instruction Tuning, enables cross-granularity learning across 28 motion-text tasks, fostering mutual enhancement between coarse and fine-grained representations. Empirical results on HumanML3D and FineMotion demonstrate state-of-the-art performance for text-to-motion and motion-to-text, as well as strong capabilities in fine-grained motion scripting and editing within a single model. This work advances practical, fine-grained control of human motions for AR/VR, gaming, and animation by unifying generation, understanding, and editing under one framework.

Abstract

Recent motion-aware large language models have demonstrated promising potential in unifying motion comprehension and generation. However, existing approaches primarily focus on coarse-grained motion-text modeling, where text describes the overall semantics of an entire motion sequence in just a few words. This limits their ability to handle fine-grained motion-relevant tasks, such as understanding and controlling the movements of specific body parts. To overcome this limitation, we pioneer MG-MotionLLM, a unified motion-language model for multi-granular motion comprehension and generation. We further introduce a comprehensive multi-granularity training scheme by incorporating a set of novel auxiliary tasks, such as localizing temporal boundaries of motion segments via detailed text as well as motion detailed captioning, to facilitate mutual reinforcement for motion-text modeling across various levels of granularity. Extensive experiments show that our MG-MotionLLM achieves superior performance on classical text-to-motion and motion-to-text tasks, and exhibits potential in novel fine-grained motion comprehension and editing tasks. Project page: CVI-SZU/MG-MotionLLM

Paper Structure

This paper contains 23 sections, 3 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: MG-MotionLLM can address diverse motion-relevant tasks at multiple granularities by giving different instructions in a unified manner. We show results for some existing coarse-grained tasks, such as text-to-motion and motion captioning (upper block), and newly developed fine-grained tasks, including motion-to-detailed text and motion localization (bottom block). The temporal progression of motion is illustrated from left to right. Green boxes denote the input, and blue boxes are the output.
  • Figure 2: Overview of our MG-MotionLLM. It consists of a motion VQ-VAE and a T5-based motion-aware language model.
  • Figure 3: Text-Driven Fine-grained Motion Editing Examples. We display some examples of temporal editing (left), spatial editing (middle), and spatial-temporal editing (right).
  • Figure 4: Other Novel Applications. We display some examples of brand-new tasks: fine-grained captioning of both whole (up) and partial (bottom) motion sequences, and motion localization via fine-grained textual description (middle).
  • Figure 5: Examples of motion scripts for motion sequences in the FineMotion dataset.
  • ...and 3 more figures