Generation of Complex 3D Human Motion by Temporal and Spatial Composition of Diffusion Models

Lorenzo Mandelli; Stefano Berretti

Generation of Complex 3D Human Motion by Temporal and Spatial Composition of Diffusion Models

Lorenzo Mandelli, Stefano Berretti

TL;DR

This paper addresses the challenge of generating re-alistic 3D human motions for action classes that were never seen during the training phase by de-composing complex actions into simpler movements by leveraging the knowledge of human motion contained in GPTs models.

Abstract

In this paper, we address the challenge of generating realistic 3D human motions for action classes that were never seen during the training phase. Our approach involves decomposing complex actions into simpler movements, specifically those observed during training, by leveraging the knowledge of human motion contained in GPTs models. These simpler movements are then combined into a single, realistic animation using the properties of diffusion models. Our claim is that this decomposition and subsequent recombination of simple movements can synthesize an animation that accurately represents the complex input action. This method operates during the inference phase and can be integrated with any pre-trained diffusion model, enabling the synthesis of motion classes not present in the training data. We evaluate our method by dividing two benchmark human motion datasets into basic and complex actions, and then compare its performance against the state-of-the-art.

Generation of Complex 3D Human Motion by Temporal and Spatial Composition of Diffusion Models

TL;DR

Abstract

Paper Structure (19 sections, 4 equations, 6 figures, 4 tables)

This paper contains 19 sections, 4 equations, 6 figures, 4 tables.

Introduction
Related work
Proposed Method
Diffusion model
MCD
Unknown action decomposition
Composition of submovements
Experiments
Datasets
Splits between base and complex actions
Metrics
Evaluation
Single text vs. composition
STMC vs. composition
Multi-annotations approach
...and 4 more sections

Figures (6)

Figure 1: The proposed 3D human motion generation pipeline during the inference phase. It is divided into two stages: first, an input textual annotation, which was not seen during the training phase, is divided into simple actions through a GPT decomposition module. Then, each simple action contributes at each step of the denoising diffusion process to generate the final motion.
Figure 2: (Left) The GPT module for annotation decomposition is initialized using the information provided in the instructions, decomposition examples, and known actions available in the training set. Its goal is to decompose any input motion annotation into sub-movement annotations present in the training set, following the provided examples, and to assign temporal boundaries to each of them. (Right) An input motion annotation, $C_{\text{input}}$, characterized by a textual description ("A person shoot a basketball") and a duration interval (start at $time = 0$, end at $time = 10$), is decomposed into a series of sub-movement annotations $C_1, C_2, \ldots, C_n$ through a call based on OpenAI's GPT. We note that generated sub-movements can connect either temporally (the green with the blue and red ones) or spatially (the blue and red sub-movements).
Figure 3: (Left) Sampling algorithm based on input decomposition: the unknown input motion, $C_{input}$, is decomposed into $n$ submovements [$c_1, c_2, \ldots, c_n$], which collectively condition the generative process to reconstruct the original motion. (Right) Denoising composition over $n$ submovements: at each denoising step, each $i$-submovement is generated based on the corresponding $C_i$. Then, all the generated motions are merged together according to their respective duration intervals into a single motion.
Figure 4: Comparison between text-only conditioned generation (a) and our decomposition-based approach (b): Text-only conditioning tends to produce stationary animations and fails to execute any meaningful actions if the textual annotations conditioning the movement do not fall within the training distribution. In contrast, by decomposing movements into sub-movements, we are able to successfully generate even complex actions.
Figure 5: Examples of original textual annotations (upper white box), corresponding decomposed sub-movement annotations (middle box), and the generated motion conditioned on the sub-movement annotations (lower human figure).
...and 1 more figures

Generation of Complex 3D Human Motion by Temporal and Spatial Composition of Diffusion Models

TL;DR

Abstract

Generation of Complex 3D Human Motion by Temporal and Spatial Composition of Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)