Shape My Moves: Text-Driven Shape-Aware Synthesis of Human Motions
Ting-Hsuan Liao, Yi Zhou, Yu Shen, Chun-Hao Paul Huang, Saayan Mitra, Jia-Bin Huang, Uttaran Bhattacharya
TL;DR
The paper tackles the problem that body shape significantly affects motion realism but is often ignored in text-to-motion methods. It introduces ShapeMove, a two-stage framework combining a Shape-Aware FSQ-VAE (SA-VAE) for shape-conditioned motion tokenization with a language-model predictor that maps text to both motion tokens and continuous shape parameters $\beta$, enabling end-to-end generation of shape-aware motions from textual prompts. SA-VAE encodes shape-normalized motions into discrete tokens using Finite Scalar Quantization and reconstructs shape-aware motions $\hat{X}^R$ by conditioning on $\tilde{\beta}$, effectively disentangling pose content from body shape. The approach is evaluated on HumanML3D with SMPL-based shape augmentation, showing superior text-motion alignment, physical plausibility (Penetrate, Float, Skate, Bone Length Variances), and perceptual preference over strong baselines, demonstrating notable improvements in realism for diverse avatar shapes. Overall, the work enables more realistic, shape-consistent avatar animation by integrating continuous shape cues with discrete motion tokens in a language-driven synthesis pipeline, with practical impact for animation, gaming, and synthetic data generation.
Abstract
We explore how body shapes influence human motion synthesis, an aspect often overlooked in existing text-to-motion generation methods due to the ease of learning a homogenized, canonical body shape. However, this homogenization can distort the natural correlations between different body shapes and their motion dynamics. Our method addresses this gap by generating body-shape-aware human motions from natural language prompts. We utilize a finite scalar quantization-based variational autoencoder (FSQ-VAE) to quantize motion into discrete tokens and then leverage continuous body shape information to de-quantize these tokens back into continuous, detailed motion. Additionally, we harness the capabilities of a pretrained language model to predict both continuous shape parameters and motion tokens, facilitating the synthesis of text-aligned motions and decoding them into shape-aware motions. We evaluate our method quantitatively and qualitatively, and also conduct a comprehensive perceptual study to demonstrate its efficacy in generating shape-aware motions.
