MotionGlot: A Multi-Embodied Motion Generation Model
Sudarshan Harithas, Srinath Sridhar
TL;DR
MotionGlot tackles the challenge of generating motion across multiple embodiments with different action spaces by transplanting instruction-tuning concepts from multilingual LLMs into motion generation. It introduces a unified tokenization and vocabulary framework via per-embodiment VQ-VAE and a cross-embodiment instruction template, enabling a single decoder to generate text and motion for humans and quadruped robots. Two dedicated datasets, QUAD-LOCO and QUES-CAP, address data scarcity and situational prompting, while hardware validation confirms real-world applicability. Across six tasks, MotionGlot achieves strong cross-embodiment generalization, multi-modal motion distribution, and competitive or superior performance to specialized baselines, demonstrating practical potential for versatile motion generation and captioning.
Abstract
This paper introduces MotionGlot, a model that can generate motion across multiple embodiments with different action dimensions, such as quadruped robots and human bodies. By leveraging the well-established training procedures commonly used in large language models (LLMs), we introduce an instruction-tuning template specifically designed for motionrelated tasks. Our approach demonstrates that the principles underlying LLM training can be successfully adapted to learn a wide range of motion generation tasks across multiple embodiments with different action dimensions. We demonstrate the various abilities of MotionGlot on a set of 6 tasks and report an average improvement of 35.3% across tasks. Additionally, we contribute two new datasets: (1) a dataset of expert-controlled quadruped locomotion with approximately 48,000 trajectories paired with direction-based text annotations, and (2) a dataset of over 23,000 situational text prompts for human motion generation tasks. Finally, we conduct hardware experiments to validate the capabilities of our system in real-world applications.
