Generating Attribute-Aware Human Motions from Textual Prompt
Xinghan Wang, Kun Xu, Fei Li, Cao Sheng, Jiazhong Yu, Yadong Mu
TL;DR
The paper addresses the gap in text-to-motion generation where human attributes (e.g., age, gender) shaping biomechanics are ignored. It introduces AttrMoGen, a framework combining a Semantic-Attribute Decoupling VQVAE with a Semantics Generative Transformer to produce attribute-aware motions by decoupling action semantics from attributes using a Structural Causal Model-inspired causal information bottleneck, with $CIB(X,Y,S,A) = I(X;S,A) + I(Y;S) - I(S;A) - \lambda I(X;S)$ guiding training. A novel HumanAttr dataset with 18.2k motion sequences and attribute annotations enables benchmarking of attribute-controlled motion generation. Experiments show AttrMoGen outperforms state-of-the-art baselines in metrics like $\text{FID}$ and $\text{R-Precision}$, while ablations validate the importance of the entropy and bottleneck terms for effective attribute decoupling and alignment with text prompts. The work demonstrates practical gains in generating realistic, attribute-consistent motions from textual prompts, enabling more context-aware animation and robotics applications, and provides counterfactual analyses to illustrate robustness of the disentanglement.
Abstract
Text-driven human motion generation has recently attracted considerable attention, allowing models to generate human motions based on textual descriptions. However, current methods neglect the influence of human attributes-such as age, gender, weight, and height-which are key factors shaping human motion patterns. This work represents a pilot exploration for bridging this gap. We conceptualize each motion as comprising both attribute information and action semantics, where textual descriptions align exclusively with action semantics. To achieve this, a new framework inspired by Structural Causal Models is proposed to decouple action semantics from human attributes, enabling text-to-semantics prediction and attribute-controlled generation. The resulting model is capable of generating attribute-aware motion aligned with the user's text and attribute inputs. For evaluation, we introduce a comprehensive dataset containing attribute annotations for text-motion pairs, setting the first benchmark for attribute-aware motion generation. Extensive experiments validate our model's effectiveness.
