Table of Contents
Fetching ...

T2M-X: Learning Expressive Text-to-Motion Generation from Partially Annotated Data

Mingdian Liu, Yilin Liu, Gurunandan Krishnan, Karl S Bayer, Bing Zhou

TL;DR

T2M-X tackles the challenge of expressive text-to-motion generation for full-body humanoid animation by learning from partially annotated data. It combines three modality-specific VQ-VAE experts (body, hand, face) with a coordinating multi-indexing GPT and a joint-space consistency loss to produce coherent, multi-part motions conditioned on text, while a jitter-mitigation pipeline and curated Mixamo/GRAB/IDEA400-based dataset enhance quality and generalization. The approach achieves superior quantitative metrics and qualitative coherence compared with body-only baselines and ablations, demonstrating improved expressiveness for hands and facial expressions and stronger cross-modal coordination. By enabling robust production-ready whole-body motion from natural language prompts, T2M-X extends the applicability of text-to-motion systems to realistic, expressive avatars in AR/VR and animation pipelines.

Abstract

The generation of humanoid animation from text prompts can profoundly impact animation production and AR/VR experiences. However, existing methods only generate body motion data, excluding facial expressions and hand movements. This limitation, primarily due to a lack of a comprehensive whole-body motion dataset, inhibits their readiness for production use. Recent attempts to create such a dataset have resulted in either motion inconsistency among different body parts in the artificially augmented data or lower quality in the data extracted from RGB videos. In this work, we propose T2M-X, a two-stage method that learns expressive text-to-motion generation from partially annotated data. T2M-X trains three separate Vector Quantized Variational AutoEncoders (VQ-VAEs) for body, hand, and face on respective high-quality data sources to ensure high-quality motion outputs, and a Multi-indexing Generative Pretrained Transformer (GPT) model with motion consistency loss for motion generation and coordination among different body parts. Our results show significant improvements over the baselines both quantitatively and qualitatively, demonstrating its robustness against the dataset limitations.

T2M-X: Learning Expressive Text-to-Motion Generation from Partially Annotated Data

TL;DR

T2M-X tackles the challenge of expressive text-to-motion generation for full-body humanoid animation by learning from partially annotated data. It combines three modality-specific VQ-VAE experts (body, hand, face) with a coordinating multi-indexing GPT and a joint-space consistency loss to produce coherent, multi-part motions conditioned on text, while a jitter-mitigation pipeline and curated Mixamo/GRAB/IDEA400-based dataset enhance quality and generalization. The approach achieves superior quantitative metrics and qualitative coherence compared with body-only baselines and ablations, demonstrating improved expressiveness for hands and facial expressions and stronger cross-modal coordination. By enabling robust production-ready whole-body motion from natural language prompts, T2M-X extends the applicability of text-to-motion systems to realistic, expressive avatars in AR/VR and animation pipelines.

Abstract

The generation of humanoid animation from text prompts can profoundly impact animation production and AR/VR experiences. However, existing methods only generate body motion data, excluding facial expressions and hand movements. This limitation, primarily due to a lack of a comprehensive whole-body motion dataset, inhibits their readiness for production use. Recent attempts to create such a dataset have resulted in either motion inconsistency among different body parts in the artificially augmented data or lower quality in the data extracted from RGB videos. In this work, we propose T2M-X, a two-stage method that learns expressive text-to-motion generation from partially annotated data. T2M-X trains three separate Vector Quantized Variational AutoEncoders (VQ-VAEs) for body, hand, and face on respective high-quality data sources to ensure high-quality motion outputs, and a Multi-indexing Generative Pretrained Transformer (GPT) model with motion consistency loss for motion generation and coordination among different body parts. Our results show significant improvements over the baselines both quantitatively and qualitatively, demonstrating its robustness against the dataset limitations.
Paper Structure (13 sections, 7 equations, 4 figures, 5 tables)

This paper contains 13 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Difference between text to motion generation and text to expressive whole-body motion generation.
  • Figure 2: Limitations of existing whole-body motion datasets.
  • Figure 3: The architecture of VQ-VAE experts for body, hand, and face motion token generation, multi-indexing GPT for coordination, and the joint space for consistency learning.
  • Figure 4: Qualitative results of the generated motions from our T2M-X models with and without consistency loss, compared with the T2M-GPT baseline.