Table of Contents
Fetching ...

MotionGlot: A Multi-Embodied Motion Generation Model

Sudarshan Harithas, Srinath Sridhar

TL;DR

MotionGlot tackles the challenge of generating motion across multiple embodiments with different action spaces by transplanting instruction-tuning concepts from multilingual LLMs into motion generation. It introduces a unified tokenization and vocabulary framework via per-embodiment VQ-VAE and a cross-embodiment instruction template, enabling a single decoder to generate text and motion for humans and quadruped robots. Two dedicated datasets, QUAD-LOCO and QUES-CAP, address data scarcity and situational prompting, while hardware validation confirms real-world applicability. Across six tasks, MotionGlot achieves strong cross-embodiment generalization, multi-modal motion distribution, and competitive or superior performance to specialized baselines, demonstrating practical potential for versatile motion generation and captioning.

Abstract

This paper introduces MotionGlot, a model that can generate motion across multiple embodiments with different action dimensions, such as quadruped robots and human bodies. By leveraging the well-established training procedures commonly used in large language models (LLMs), we introduce an instruction-tuning template specifically designed for motionrelated tasks. Our approach demonstrates that the principles underlying LLM training can be successfully adapted to learn a wide range of motion generation tasks across multiple embodiments with different action dimensions. We demonstrate the various abilities of MotionGlot on a set of 6 tasks and report an average improvement of 35.3% across tasks. Additionally, we contribute two new datasets: (1) a dataset of expert-controlled quadruped locomotion with approximately 48,000 trajectories paired with direction-based text annotations, and (2) a dataset of over 23,000 situational text prompts for human motion generation tasks. Finally, we conduct hardware experiments to validate the capabilities of our system in real-world applications.

MotionGlot: A Multi-Embodied Motion Generation Model

TL;DR

MotionGlot tackles the challenge of generating motion across multiple embodiments with different action spaces by transplanting instruction-tuning concepts from multilingual LLMs into motion generation. It introduces a unified tokenization and vocabulary framework via per-embodiment VQ-VAE and a cross-embodiment instruction template, enabling a single decoder to generate text and motion for humans and quadruped robots. Two dedicated datasets, QUAD-LOCO and QUES-CAP, address data scarcity and situational prompting, while hardware validation confirms real-world applicability. Across six tasks, MotionGlot achieves strong cross-embodiment generalization, multi-modal motion distribution, and competitive or superior performance to specialized baselines, demonstrating practical potential for versatile motion generation and captioning.

Abstract

This paper introduces MotionGlot, a model that can generate motion across multiple embodiments with different action dimensions, such as quadruped robots and human bodies. By leveraging the well-established training procedures commonly used in large language models (LLMs), we introduce an instruction-tuning template specifically designed for motionrelated tasks. Our approach demonstrates that the principles underlying LLM training can be successfully adapted to learn a wide range of motion generation tasks across multiple embodiments with different action dimensions. We demonstrate the various abilities of MotionGlot on a set of 6 tasks and report an average improvement of 35.3% across tasks. Additionally, we contribute two new datasets: (1) a dataset of expert-controlled quadruped locomotion with approximately 48,000 trajectories paired with direction-based text annotations, and (2) a dataset of over 23,000 situational text prompts for human motion generation tasks. Finally, we conduct hardware experiments to validate the capabilities of our system in real-world applications.

Paper Structure

This paper contains 19 sections, 3 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Overview: MotionGlot is a model that can generate motion trajectories that obey user instructions across multiple embodiments with different action dimensions, such as (a) quadruped robots, and (b) humans. The figures (a,b) depict the qualitative benchmark of MotionGlot against the adapted templates (A.T) of rt2 on the text-to-robot motion (\ref{['sec:exp_t2rm']}), Q&A with human motion (\ref{['exp:q_and_a']}) tasks respectively. The overall quantitative performance across tasks is shown in (c). In (a,b), increasing opacity indicates forward time.
  • Figure 2: (a) Trajectories from different embodiments are tokenized using their associate VQ-VAE vqvae (\ref{['sec:tokenization']}). (b) The proposed instruction template (\ref{['sec:motionglot_template']}) is used to train GPT for motion and text generation. Note that the tokenizer and de-tokenizer operate on the expanded vocabulary \ref{['sec:vocab_exp']} ($\mathcal{V}$) (c) The preview of the QUAD-LOCO dataset, the captions indicate the direction-based text annotation.
  • Figure 3: Qualitative results of the goal reaching task: note that our method expresses the multi-modal nature of the trajectory distribution, while diffuser generates path towards the goal, its success of convergence at goal is lower.