Table of Contents
Fetching ...

Shape My Moves: Text-Driven Shape-Aware Synthesis of Human Motions

Ting-Hsuan Liao, Yi Zhou, Yu Shen, Chun-Hao Paul Huang, Saayan Mitra, Jia-Bin Huang, Uttaran Bhattacharya

TL;DR

The paper tackles the problem that body shape significantly affects motion realism but is often ignored in text-to-motion methods. It introduces ShapeMove, a two-stage framework combining a Shape-Aware FSQ-VAE (SA-VAE) for shape-conditioned motion tokenization with a language-model predictor that maps text to both motion tokens and continuous shape parameters $\beta$, enabling end-to-end generation of shape-aware motions from textual prompts. SA-VAE encodes shape-normalized motions into discrete tokens using Finite Scalar Quantization and reconstructs shape-aware motions $\hat{X}^R$ by conditioning on $\tilde{\beta}$, effectively disentangling pose content from body shape. The approach is evaluated on HumanML3D with SMPL-based shape augmentation, showing superior text-motion alignment, physical plausibility (Penetrate, Float, Skate, Bone Length Variances), and perceptual preference over strong baselines, demonstrating notable improvements in realism for diverse avatar shapes. Overall, the work enables more realistic, shape-consistent avatar animation by integrating continuous shape cues with discrete motion tokens in a language-driven synthesis pipeline, with practical impact for animation, gaming, and synthetic data generation.

Abstract

We explore how body shapes influence human motion synthesis, an aspect often overlooked in existing text-to-motion generation methods due to the ease of learning a homogenized, canonical body shape. However, this homogenization can distort the natural correlations between different body shapes and their motion dynamics. Our method addresses this gap by generating body-shape-aware human motions from natural language prompts. We utilize a finite scalar quantization-based variational autoencoder (FSQ-VAE) to quantize motion into discrete tokens and then leverage continuous body shape information to de-quantize these tokens back into continuous, detailed motion. Additionally, we harness the capabilities of a pretrained language model to predict both continuous shape parameters and motion tokens, facilitating the synthesis of text-aligned motions and decoding them into shape-aware motions. We evaluate our method quantitatively and qualitatively, and also conduct a comprehensive perceptual study to demonstrate its efficacy in generating shape-aware motions.

Shape My Moves: Text-Driven Shape-Aware Synthesis of Human Motions

TL;DR

The paper tackles the problem that body shape significantly affects motion realism but is often ignored in text-to-motion methods. It introduces ShapeMove, a two-stage framework combining a Shape-Aware FSQ-VAE (SA-VAE) for shape-conditioned motion tokenization with a language-model predictor that maps text to both motion tokens and continuous shape parameters , enabling end-to-end generation of shape-aware motions from textual prompts. SA-VAE encodes shape-normalized motions into discrete tokens using Finite Scalar Quantization and reconstructs shape-aware motions by conditioning on , effectively disentangling pose content from body shape. The approach is evaluated on HumanML3D with SMPL-based shape augmentation, showing superior text-motion alignment, physical plausibility (Penetrate, Float, Skate, Bone Length Variances), and perceptual preference over strong baselines, demonstrating notable improvements in realism for diverse avatar shapes. Overall, the work enables more realistic, shape-consistent avatar animation by integrating continuous shape cues with discrete motion tokens in a language-driven synthesis pipeline, with practical impact for animation, gaming, and synthetic data generation.

Abstract

We explore how body shapes influence human motion synthesis, an aspect often overlooked in existing text-to-motion generation methods due to the ease of learning a homogenized, canonical body shape. However, this homogenization can distort the natural correlations between different body shapes and their motion dynamics. Our method addresses this gap by generating body-shape-aware human motions from natural language prompts. We utilize a finite scalar quantization-based variational autoencoder (FSQ-VAE) to quantize motion into discrete tokens and then leverage continuous body shape information to de-quantize these tokens back into continuous, detailed motion. Additionally, we harness the capabilities of a pretrained language model to predict both continuous shape parameters and motion tokens, facilitating the synthesis of text-aligned motions and decoding them into shape-aware motions. We evaluate our method quantitatively and qualitatively, and also conduct a comprehensive perceptual study to demonstrate its efficacy in generating shape-aware motions.

Paper Structure

This paper contains 23 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Text-Driven Shape-Aware Motion Synthesis. The same motion performed by different body shapes can vary significantly, a realism aspect often overlooked in text-to-motion tasks. We propose a novel framework to integrate both shape and motion descriptions as input. Our framework synthesizes the shape parameters to reflect the described physical attributes, and injects them into motion synthesis to generate plausible shape-aware motions. The figure demonstrates the same running motion synthesized across different body shapes.
  • Figure 2: Shape-Aware FSQ-VAE (SA-VAE) Overview. SA-VAE is our quantization network learning to generate discrete motion tokens. Given a shape-normalized motion $X^N \in \mathbb{R}^{T \times D}$ of length $T$ and dimensionality $D$ ($= 263$ in our setup), we first encode the motion with the Motion Encoder $\mathcal{E}$ into a motion feature $Z \in \mathbb{R}^{\tau \times D}$, where $\tau$ represents a downsampling of $T$. We leverage the FSQ fsq quantizer to quantize $Z$, which gives a discrete feature $\hat{Z}$. The $\text{MLP}_{\theta_{m}}$ and $\text{MLP}_{\phi_{m}}$ transform the features into the required code dimensions. To condition on the shape, we project the shape parameter $\beta$ with the Projector $P_{\theta_{s}}$ to align with $\hat{Z}$. We concatenate the shape feature $\tilde{\beta}$ with $\hat{Z}$, then feed it into the Motion Decoder $\mathcal{D}$ to predict the reconstructed motion ${\hat{X}^R}$.
  • Figure 3: ShapeMove Overview. In the training phase (a), the transformer network takes in the text inputs describing human motions and body shapes and predicts quantized motion tokens and the shape token [BETA]. The embedding for [BETA] passes through the Projector $P_{\theta_{e}}$ to predict the shape parameter $\hat{\beta}$. We use cross-entropy loss for comparing ground truth tokens $C$ with predicted tokens $\hat{C}$, and $L1$ loss for shape parameter to optimize the model. In the inference phase (b), our model predicts motion tokens $\hat{C}$ and the shape parameter $\hat{\beta}$ from text inputs. We de-quantize these tokens using FSQ, and project into shape parameters with Projector $P_{\theta_{s}}$. We concatenate $\hat{\beta}$ and $\hat{C}$, and decode into the generated motion sequence with the Motion Decoder $\mathcal{D}$. Our model effectively synthesizes shape parameters and shape-aware motions reflecting the physical form and actions described in the input text.
  • Figure 4: Qualitative Comparisons. We compare our method with three baseline methods, T2M-GPT t2mgpt, MotionGPT motiongpt, and MotionDiffuse motiondiffuse, illustrating two samples from the HumanML3D test set. The motions are colored from light to dark blue to represent progression over time. We highlight issues such as incorrect foot motion and other inaccuracies that do not align with expected motion patterns. Our method not only generates motions that align with the textual descriptions, but also accurately follows the body attributes and physical dynamics of the ground truth. Additional visual results and detailed comparisons are available in the project website.
  • Figure 5: Perceptual Evaluation. We show the distributions of aggregate responses from annotators on their preferences for samples generated by our method and baseline methods, including MotionDiffuse motiondiffuse, MotionGPT motiongpt, and T2M-GPT t2mgpt, as well as the corresponding ground truth samples. We assess the distributions on three metrics: (a) Shape to Text, how well the body shape matches the text input; (b) Motion to Text, how well the motion matches the text input; and (c) Plausibility of Motion with Shape, how realistic the motions appear for the corresponding body shapes. Across all three metrics, we observe that our method is preferred nearly as much as the ground truth and is favored by approximately 12% to 38% over the baselines.