Table of Contents
Fetching ...

SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation

Nikos Athanasiou, Mathis Petrovich, Michael J. Black, Gül Varol

TL;DR

This work addresses the challenge of generating 3D human motions that realize multiple simultaneous actions described by text, a problem referred to as spatial composition. It introduces SINC, a TEMOS-based text-to-motion model augmented with GPT-3-guided synthetic data that maps actions to body parts and stitches compatible motions, enabling realistic concurrent actions. The authors present a GPT-driven pipeline for body-part labeling, synthesize new compositional data, and demonstrate improvements over baselines on the BABEL dataset using a TEMOS-based evaluation framework, including a novel TEMOS score. The approach reduces data scarcity for compositional actions and enables flexible, free-form input prompts, with code released for research use; this advances fine-grained, multi-action motion synthesis with potential applications in animation and interactive AI systems.

Abstract

Our goal is to synthesize 3D human motions given textual inputs describing simultaneous actions, for example 'waving hand' while 'walking' at the same time. We refer to generating such simultaneous movements as performing 'spatial compositions'. In contrast to temporal compositions that seek to transition from one action to another, spatial compositing requires understanding which body parts are involved in which action, to be able to move them simultaneously. Motivated by the observation that the correspondence between actions and body parts is encoded in powerful language models, we extract this knowledge by prompting GPT-3 with text such as "what are the body parts involved in the action <action name>?", while also providing the parts list and few-shot examples. Given this action-part mapping, we combine body parts from two motions together and establish the first automated method to spatially compose two actions. However, training data with compositional actions is always limited by the combinatorics. Hence, we further create synthetic data with this approach, and use it to train a new state-of-the-art text-to-motion generation model, called SINC ("SImultaneous actioN Compositions for 3D human motions"). In our experiments, that training with such GPT-guided synthetic data improves spatial composition generation over baselines. Our code is publicly available at https://sinc.is.tue.mpg.de/.

SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation

TL;DR

This work addresses the challenge of generating 3D human motions that realize multiple simultaneous actions described by text, a problem referred to as spatial composition. It introduces SINC, a TEMOS-based text-to-motion model augmented with GPT-3-guided synthetic data that maps actions to body parts and stitches compatible motions, enabling realistic concurrent actions. The authors present a GPT-driven pipeline for body-part labeling, synthesize new compositional data, and demonstrate improvements over baselines on the BABEL dataset using a TEMOS-based evaluation framework, including a novel TEMOS score. The approach reduces data scarcity for compositional actions and enables flexible, free-form input prompts, with code released for research use; this advances fine-grained, multi-action motion synthesis with potential applications in animation and interactive AI systems.

Abstract

Our goal is to synthesize 3D human motions given textual inputs describing simultaneous actions, for example 'waving hand' while 'walking' at the same time. We refer to generating such simultaneous movements as performing 'spatial compositions'. In contrast to temporal compositions that seek to transition from one action to another, spatial compositing requires understanding which body parts are involved in which action, to be able to move them simultaneously. Motivated by the observation that the correspondence between actions and body parts is encoded in powerful language models, we extract this knowledge by prompting GPT-3 with text such as "what are the body parts involved in the action <action name>?", while also providing the parts list and few-shot examples. Given this action-part mapping, we combine body parts from two motions together and establish the first automated method to spatially compose two actions. However, training data with compositional actions is always limited by the combinatorics. Hence, we further create synthetic data with this approach, and use it to train a new state-of-the-art text-to-motion generation model, called SINC ("SImultaneous actioN Compositions for 3D human motions"). In our experiments, that training with such GPT-guided synthetic data improves spatial composition generation over baselines. Our code is publicly available at https://sinc.is.tue.mpg.de/.
Paper Structure (25 sections, 2 equations, 9 figures, 10 tables)

This paper contains 25 sections, 2 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Goal: We demonstrate the task of spatial compositions in human motion synthesis. We generate 3D motions for a pair of actions, defined by a pair of textual descriptions. Here, we provide six sample input-output illustrations from our model. For example, we input the set of actions {'put hands on the waist', 'move torso left'} and generate one motion that simultaneously performs both.
  • Figure 2: GPT-guided synthetic training data creation: We illustrate our procedure to generate Synth-Pairs. Here, we combine two motion sequences from the training set with the corresponding labels 'stroll' and 'raise arms'. We first prompt GPT-3 with the instructions, few-shot examples containing question-answer pairs, and giving the action of interest in the last question without the answer. We minimally post-process the output of GPT-3 to assign this action to a set of body parts. The relevant body parts from each motion are then stitched together to form a new synthetically composited motion.
  • Figure 3: Model architecture: We extend TEMOS petrovich2022temos such that it is trained with compositional actions. We build multiple descriptions given two action labels, by adding words such as 'while', 'during', etc. We then randomly sample one version during training as input to the text encoder.
  • Figure 4: Single-action GPT-compositing vs SINC: We show two examples that highlight the advantage of our model compared to GPT compositions. Top: The detected body parts overlap causing the stitching to generate a forwards movement. Bottom: The global orientation is taken from the 'walk forwards' failing to generate a left turn.
  • Figure 5: Qualitative analysis: (a) We present qualitative results for our final model, SINC, for various description pairs from the validation set. Our generations correctly correspond to the input semantics even when they are different from the ground truth, highlighting the challenge of coordinate-based (positional) performance measures. We display the ground truth (GT) for reference to define what the given actions mean. (b) We compare different models on two simultaneous action pairs. Both the Single-action model and the model not trained on synthetic data fail to generate those two compositions. Our model trained with the synthetic data successfully generates the composition in both cases. We include more comparisons in the supplementary video on our project page.
  • ...and 4 more figures