SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation
Nikos Athanasiou, Mathis Petrovich, Michael J. Black, Gül Varol
TL;DR
This work addresses the challenge of generating 3D human motions that realize multiple simultaneous actions described by text, a problem referred to as spatial composition. It introduces SINC, a TEMOS-based text-to-motion model augmented with GPT-3-guided synthetic data that maps actions to body parts and stitches compatible motions, enabling realistic concurrent actions. The authors present a GPT-driven pipeline for body-part labeling, synthesize new compositional data, and demonstrate improvements over baselines on the BABEL dataset using a TEMOS-based evaluation framework, including a novel TEMOS score. The approach reduces data scarcity for compositional actions and enables flexible, free-form input prompts, with code released for research use; this advances fine-grained, multi-action motion synthesis with potential applications in animation and interactive AI systems.
Abstract
Our goal is to synthesize 3D human motions given textual inputs describing simultaneous actions, for example 'waving hand' while 'walking' at the same time. We refer to generating such simultaneous movements as performing 'spatial compositions'. In contrast to temporal compositions that seek to transition from one action to another, spatial compositing requires understanding which body parts are involved in which action, to be able to move them simultaneously. Motivated by the observation that the correspondence between actions and body parts is encoded in powerful language models, we extract this knowledge by prompting GPT-3 with text such as "what are the body parts involved in the action <action name>?", while also providing the parts list and few-shot examples. Given this action-part mapping, we combine body parts from two motions together and establish the first automated method to spatially compose two actions. However, training data with compositional actions is always limited by the combinatorics. Hence, we further create synthetic data with this approach, and use it to train a new state-of-the-art text-to-motion generation model, called SINC ("SImultaneous actioN Compositions for 3D human motions"). In our experiments, that training with such GPT-guided synthetic data improves spatial composition generation over baselines. Our code is publicly available at https://sinc.is.tue.mpg.de/.
