Table of Contents
Fetching ...

VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension

Zeyu Ling, Bo Han, Shiyang Li, Jikang Cheng, Hongdeng Shen, Changqing Zou

TL;DR

VersatileMotion proposes a unified multimodal Motion LLM that extends token-based generation to motion, text, music, and speech, accommodating both single- and multi-agent scenarios. It introduces FlowVQ, a Flow Matching-based decoder atop a VQ–VAE, and MotionHub, a large standardized motion corpus with nine benchmarks, enabling robust large-scale pretraining. A three-stage generalist-to-specialist training pipeline yields a strong generalist model and specialist variants that achieve state-of-the-art results on seven of nine tasks. The framework supports cross-modal translation and high-fidelity motion synthesis, laying a scalable foundation for future multimodal motion understanding and generation.

Abstract

Large language models (LLMs) are, by design, inherently capable of multi-task learning: through a unified next-token prediction paradigm, they can naturally address a wide variety of downstream tasks. Prior work in the motion domain has demonstrated some generality by adapting LLMs via a Motion Tokenizer coupled with an autoregressive Transformer to generate and understand human motion. However, this generality remains limited in scope and yields only modest performance gains. We introduce VersatileMotion, a unified multimodal motion LLM that combines a novel motion tokenizer, integrating VQ-VAE with flow matching, and an autoregressive transformer backbone to seamlessly support at least nine distinct motion-related tasks. VersatileMotion is the first method to handle single-agent and multi-agent motions in a single framework and enable cross-modal conversion between motion, text, music, and speech, achieving state-of-the-art performance on seven of these tasks. Each sequence in MotionHub may include one or more of the following annotations: natural-language captions, music or audio clips, speech transcripts, and multi-agent interaction data. To facilitate evaluation, we define and release benchmark splits covering nine core tasks. Extensive experiments demonstrate the superior performance, versatility, and potential of VersatileMotion as a foundational model for future understanding and generation of motion.

VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension

TL;DR

VersatileMotion proposes a unified multimodal Motion LLM that extends token-based generation to motion, text, music, and speech, accommodating both single- and multi-agent scenarios. It introduces FlowVQ, a Flow Matching-based decoder atop a VQ–VAE, and MotionHub, a large standardized motion corpus with nine benchmarks, enabling robust large-scale pretraining. A three-stage generalist-to-specialist training pipeline yields a strong generalist model and specialist variants that achieve state-of-the-art results on seven of nine tasks. The framework supports cross-modal translation and high-fidelity motion synthesis, laying a scalable foundation for future multimodal motion understanding and generation.

Abstract

Large language models (LLMs) are, by design, inherently capable of multi-task learning: through a unified next-token prediction paradigm, they can naturally address a wide variety of downstream tasks. Prior work in the motion domain has demonstrated some generality by adapting LLMs via a Motion Tokenizer coupled with an autoregressive Transformer to generate and understand human motion. However, this generality remains limited in scope and yields only modest performance gains. We introduce VersatileMotion, a unified multimodal motion LLM that combines a novel motion tokenizer, integrating VQ-VAE with flow matching, and an autoregressive transformer backbone to seamlessly support at least nine distinct motion-related tasks. VersatileMotion is the first method to handle single-agent and multi-agent motions in a single framework and enable cross-modal conversion between motion, text, music, and speech, achieving state-of-the-art performance on seven of these tasks. Each sequence in MotionHub may include one or more of the following annotations: natural-language captions, music or audio clips, speech transcripts, and multi-agent interaction data. To facilitate evaluation, we define and release benchmark splits covering nine core tasks. Extensive experiments demonstrate the superior performance, versatility, and potential of VersatileMotion as a foundational model for future understanding and generation of motion.

Paper Structure

This paper contains 56 sections, 12 equations, 10 figures, 24 tables.

Figures (10)

  • Figure 2: The schematic of FlowVQ.
  • Figure 3: Qualitative comparison of VersatileMotion and previous state‑of‑the‑art methods on single‑ and multi‑agent text‑to‑motion tasks.
  • Figure 4: The motion generation results of VersatileMotion driven jointly by audio and text.
  • Figure 5: Samples from the different tasks in our constructed MotionHub.
  • Figure 6: Single-Person Text-to-Motion visualization samples.
  • ...and 5 more figures