Table of Contents
Fetching ...

Massively Multi-Person 3D Human Motion Forecasting with Scene Context

Felix B Mueller, Julian Tanke, Juergen Gall

TL;DR

A scene-aware social transformer model (SAST) to forecast long-term (10s) human motion motion, which can model interactions between both widely varying numbers of people and objects in a scene and model the conditional motion distribution using denoising diffusion models.

Abstract

Forecasting long-term 3D human motion is challenging: the stochasticity of human behavior makes it hard to generate realistic human motion from the input sequence alone. Information on the scene environment and the motion of nearby people can greatly aid the generation process. We propose a scene-aware social transformer model (SAST) to forecast long-term (10s) human motion motion. Unlike previous models, our approach can model interactions between both widely varying numbers of people and objects in a scene. We combine a temporal convolutional encoder-decoder architecture with a Transformer-based bottleneck that allows us to efficiently combine motion and scene information. We model the conditional motion distribution using denoising diffusion models. We benchmark our approach on the Humans in Kitchens dataset, which contains 1 to 16 persons and 29 to 50 objects that are visible simultaneously. Our model outperforms other approaches in terms of realism and diversity on different metrics and in a user study. Code is available at https://github.com/felixbmuller/SAST.

Massively Multi-Person 3D Human Motion Forecasting with Scene Context

TL;DR

A scene-aware social transformer model (SAST) to forecast long-term (10s) human motion motion, which can model interactions between both widely varying numbers of people and objects in a scene and model the conditional motion distribution using denoising diffusion models.

Abstract

Forecasting long-term 3D human motion is challenging: the stochasticity of human behavior makes it hard to generate realistic human motion from the input sequence alone. Information on the scene environment and the motion of nearby people can greatly aid the generation process. We propose a scene-aware social transformer model (SAST) to forecast long-term (10s) human motion motion. Unlike previous models, our approach can model interactions between both widely varying numbers of people and objects in a scene. We combine a temporal convolutional encoder-decoder architecture with a Transformer-based bottleneck that allows us to efficiently combine motion and scene information. We model the conditional motion distribution using denoising diffusion models. We benchmark our approach on the Humans in Kitchens dataset, which contains 1 to 16 persons and 29 to 50 objects that are visible simultaneously. Our model outperforms other approaches in terms of realism and diversity on different metrics and in a user study. Code is available at https://github.com/felixbmuller/SAST.
Paper Structure (39 sections, 18 equations, 7 figures, 5 tables)

This paper contains 39 sections, 18 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 2: Architecture of the denoising model $f_\theta$.
  • Figure 3: The realism scoring model calculates a score based on short single-person motion snippets. It is trained to distinguish real and synthetic motion, the latter is generated using our model and the baseline models.
  • Figure 4: Visualization of ten-second output trajectories $r^{n:N}$ for each model. The last frame of the input sequences is normalized to $(0,0)$ with persons facing in positive $y$-direction. 20 randomly selected trajectories per model displayed.
  • Figure 5: Frame-wise mean global velocity for all models. We calculate the velocity of the hip center in the x- and y-direction and average over all evaluation samples. Outliers in the ground truth data (few single-frame velocities over 10 m/s) are clipped before averaging.
  • Figure 6: Samples of Ours creating diverse motion based on a fixed input. In the input sequence , a person starts to stand up. a) In the ground truth ( to ), the person kneels on the sofa to write on the whiteboard . b--d) Ours ( to ) predicts writing on the whiteboard twice, once stepping on and once stepping over the sofa. The third prediction shows hesitant standing up motion.
  • ...and 2 more figures