VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension

Zeyu Ling; Bo Han; Shiyang Li; Jikang Cheng; Hongdeng Shen; Changqing Zou

VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension

Zeyu Ling, Bo Han, Shiyang Li, Jikang Cheng, Hongdeng Shen, Changqing Zou

TL;DR

VersatileMotion proposes a unified multimodal Motion LLM that extends token-based generation to motion, text, music, and speech, accommodating both single- and multi-agent scenarios. It introduces FlowVQ, a Flow Matching-based decoder atop a VQ–VAE, and MotionHub, a large standardized motion corpus with nine benchmarks, enabling robust large-scale pretraining. A three-stage generalist-to-specialist training pipeline yields a strong generalist model and specialist variants that achieve state-of-the-art results on seven of nine tasks. The framework supports cross-modal translation and high-fidelity motion synthesis, laying a scalable foundation for future multimodal motion understanding and generation.

Abstract

Large language models (LLMs) are, by design, inherently capable of multi-task learning: through a unified next-token prediction paradigm, they can naturally address a wide variety of downstream tasks. Prior work in the motion domain has demonstrated some generality by adapting LLMs via a Motion Tokenizer coupled with an autoregressive Transformer to generate and understand human motion. However, this generality remains limited in scope and yields only modest performance gains. We introduce VersatileMotion, a unified multimodal motion LLM that combines a novel motion tokenizer, integrating VQ-VAE with flow matching, and an autoregressive transformer backbone to seamlessly support at least nine distinct motion-related tasks. VersatileMotion is the first method to handle single-agent and multi-agent motions in a single framework and enable cross-modal conversion between motion, text, music, and speech, achieving state-of-the-art performance on seven of these tasks. Each sequence in MotionHub may include one or more of the following annotations: natural-language captions, music or audio clips, speech transcripts, and multi-agent interaction data. To facilitate evaluation, we define and release benchmark splits covering nine core tasks. Extensive experiments demonstrate the superior performance, versatility, and potential of VersatileMotion as a foundational model for future understanding and generation of motion.

VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension

TL;DR

Abstract

VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)