Table of Contents
Fetching ...

Generation of Complex 3D Human Motion by Temporal and Spatial Composition of Diffusion Models

Lorenzo Mandelli, Stefano Berretti

TL;DR

This paper addresses the challenge of generating re-alistic 3D human motions for action classes that were never seen during the training phase by de-composing complex actions into simpler movements by leveraging the knowledge of human motion contained in GPTs models.

Abstract

In this paper, we address the challenge of generating realistic 3D human motions for action classes that were never seen during the training phase. Our approach involves decomposing complex actions into simpler movements, specifically those observed during training, by leveraging the knowledge of human motion contained in GPTs models. These simpler movements are then combined into a single, realistic animation using the properties of diffusion models. Our claim is that this decomposition and subsequent recombination of simple movements can synthesize an animation that accurately represents the complex input action. This method operates during the inference phase and can be integrated with any pre-trained diffusion model, enabling the synthesis of motion classes not present in the training data. We evaluate our method by dividing two benchmark human motion datasets into basic and complex actions, and then compare its performance against the state-of-the-art.

Generation of Complex 3D Human Motion by Temporal and Spatial Composition of Diffusion Models

TL;DR

This paper addresses the challenge of generating re-alistic 3D human motions for action classes that were never seen during the training phase by de-composing complex actions into simpler movements by leveraging the knowledge of human motion contained in GPTs models.

Abstract

In this paper, we address the challenge of generating realistic 3D human motions for action classes that were never seen during the training phase. Our approach involves decomposing complex actions into simpler movements, specifically those observed during training, by leveraging the knowledge of human motion contained in GPTs models. These simpler movements are then combined into a single, realistic animation using the properties of diffusion models. Our claim is that this decomposition and subsequent recombination of simple movements can synthesize an animation that accurately represents the complex input action. This method operates during the inference phase and can be integrated with any pre-trained diffusion model, enabling the synthesis of motion classes not present in the training data. We evaluate our method by dividing two benchmark human motion datasets into basic and complex actions, and then compare its performance against the state-of-the-art.
Paper Structure (19 sections, 4 equations, 6 figures, 4 tables)

This paper contains 19 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The proposed 3D human motion generation pipeline during the inference phase. It is divided into two stages: first, an input textual annotation, which was not seen during the training phase, is divided into simple actions through a GPT decomposition module. Then, each simple action contributes at each step of the denoising diffusion process to generate the final motion.
  • Figure 2: (Left) The GPT module for annotation decomposition is initialized using the information provided in the instructions, decomposition examples, and known actions available in the training set. Its goal is to decompose any input motion annotation into sub-movement annotations present in the training set, following the provided examples, and to assign temporal boundaries to each of them. (Right) An input motion annotation, $C_{\text{input}}$, characterized by a textual description ("A person shoot a basketball") and a duration interval (start at $time = 0$, end at $time = 10$), is decomposed into a series of sub-movement annotations $C_1, C_2, \ldots, C_n$ through a call based on OpenAI's GPT. We note that generated sub-movements can connect either temporally (the green with the blue and red ones) or spatially (the blue and red sub-movements).
  • Figure 3: (Left) Sampling algorithm based on input decomposition: the unknown input motion, $C_{input}$, is decomposed into $n$ submovements [$c_1, c_2, \ldots, c_n$], which collectively condition the generative process to reconstruct the original motion. (Right) Denoising composition over $n$ submovements: at each denoising step, each $i$-submovement is generated based on the corresponding $C_i$. Then, all the generated motions are merged together according to their respective duration intervals into a single motion.
  • Figure 4: Comparison between text-only conditioned generation (a) and our decomposition-based approach (b): Text-only conditioning tends to produce stationary animations and fails to execute any meaningful actions if the textual annotations conditioning the movement do not fall within the training distribution. In contrast, by decomposing movements into sub-movements, we are able to successfully generate even complex actions.
  • Figure 5: Examples of original textual annotations (upper white box), corresponding decomposed sub-movement annotations (middle box), and the generated motion conditioned on the sub-movement annotations (lower human figure).
  • ...and 1 more figures