Table of Contents
Fetching ...

Generating Attribute-Aware Human Motions from Textual Prompt

Xinghan Wang, Kun Xu, Fei Li, Cao Sheng, Jiazhong Yu, Yadong Mu

TL;DR

The paper addresses the gap in text-to-motion generation where human attributes (e.g., age, gender) shaping biomechanics are ignored. It introduces AttrMoGen, a framework combining a Semantic-Attribute Decoupling VQVAE with a Semantics Generative Transformer to produce attribute-aware motions by decoupling action semantics from attributes using a Structural Causal Model-inspired causal information bottleneck, with $CIB(X,Y,S,A) = I(X;S,A) + I(Y;S) - I(S;A) - \lambda I(X;S)$ guiding training. A novel HumanAttr dataset with 18.2k motion sequences and attribute annotations enables benchmarking of attribute-controlled motion generation. Experiments show AttrMoGen outperforms state-of-the-art baselines in metrics like $\text{FID}$ and $\text{R-Precision}$, while ablations validate the importance of the entropy and bottleneck terms for effective attribute decoupling and alignment with text prompts. The work demonstrates practical gains in generating realistic, attribute-consistent motions from textual prompts, enabling more context-aware animation and robotics applications, and provides counterfactual analyses to illustrate robustness of the disentanglement.

Abstract

Text-driven human motion generation has recently attracted considerable attention, allowing models to generate human motions based on textual descriptions. However, current methods neglect the influence of human attributes-such as age, gender, weight, and height-which are key factors shaping human motion patterns. This work represents a pilot exploration for bridging this gap. We conceptualize each motion as comprising both attribute information and action semantics, where textual descriptions align exclusively with action semantics. To achieve this, a new framework inspired by Structural Causal Models is proposed to decouple action semantics from human attributes, enabling text-to-semantics prediction and attribute-controlled generation. The resulting model is capable of generating attribute-aware motion aligned with the user's text and attribute inputs. For evaluation, we introduce a comprehensive dataset containing attribute annotations for text-motion pairs, setting the first benchmark for attribute-aware motion generation. Extensive experiments validate our model's effectiveness.

Generating Attribute-Aware Human Motions from Textual Prompt

TL;DR

The paper addresses the gap in text-to-motion generation where human attributes (e.g., age, gender) shaping biomechanics are ignored. It introduces AttrMoGen, a framework combining a Semantic-Attribute Decoupling VQVAE with a Semantics Generative Transformer to produce attribute-aware motions by decoupling action semantics from attributes using a Structural Causal Model-inspired causal information bottleneck, with guiding training. A novel HumanAttr dataset with 18.2k motion sequences and attribute annotations enables benchmarking of attribute-controlled motion generation. Experiments show AttrMoGen outperforms state-of-the-art baselines in metrics like and , while ablations validate the importance of the entropy and bottleneck terms for effective attribute decoupling and alignment with text prompts. The work demonstrates practical gains in generating realistic, attribute-consistent motions from textual prompts, enabling more context-aware animation and robotics applications, and provides counterfactual analyses to illustrate robustness of the disentanglement.

Abstract

Text-driven human motion generation has recently attracted considerable attention, allowing models to generate human motions based on textual descriptions. However, current methods neglect the influence of human attributes-such as age, gender, weight, and height-which are key factors shaping human motion patterns. This work represents a pilot exploration for bridging this gap. We conceptualize each motion as comprising both attribute information and action semantics, where textual descriptions align exclusively with action semantics. To achieve this, a new framework inspired by Structural Causal Models is proposed to decouple action semantics from human attributes, enabling text-to-semantics prediction and attribute-controlled generation. The resulting model is capable of generating attribute-aware motion aligned with the user's text and attribute inputs. For evaluation, we introduce a comprehensive dataset containing attribute annotations for text-motion pairs, setting the first benchmark for attribute-aware motion generation. Extensive experiments validate our model's effectiveness.

Paper Structure

This paper contains 20 sections, 17 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Data samples from HumanAttr dataset with text prompts and human attributes. Notice that the motion patterns of subjects with different attributes vary significantly.
  • Figure 2: Statistics of age, gender, motion duration (in seconds), and text length (in words) of the HumanAttr dataset.
  • Figure 3: Structural Causal Model for our Decoup-VQVAE. Our objective is to learn an encoder capable of decoupling the $Y$-causative action semantics $S$ from raw motion $X$.
  • Figure 4: Overall architecture of our proposed AttrMoGen. The encoder of Decoup-VQVAE uses a causal information bottleneck to decouple action semantics from human attributes, producing attribute-free semantic tokens. The decoder then reconstructs motion from these semantic tokens and attribute labels. The Semantics Generative Transformer predicts semantic tokens from textual input, which are subsequently combined with attribute inputs to generate attribute-aware human motions during inference.
  • Figure 5: Visualization of generated motions of MoMask and AttrMoGen. As shown, subjects of different attributes exhibit variations in the extent and patterns of movements.
  • ...and 4 more figures