Table of Contents
Fetching ...

Scene-aware Human Motion Forecasting via Mutual Distance Prediction

Chaoyue Xing, Wei Mao, Miaomiao Liu

TL;DR

The paper tackles scene-aware 3D human motion forecasting by introducing a mutual distance representation that jointly constrains local body pose and global motion. It couples per-vertex signed distances to the scene with per-basis point distances to the human, and leverages a global SDF-based scene representation to maintain coherence between global and local scene cues. The approach predicts mutual distances in a first stage using DCT and GCN encodings, then forecasts future poses with an autoregressive RNN, training with consistency losses derived from the SDF and basis distances. Through evaluations on four datasets, the method consistently outperforms state-of-the-art baselines, with ablations demonstrating the importance of both distance terms and the SDF-based scene representation, indicating stronger, more plausible interactions between humans and scenes. These advances have practical implications for robotics, animation, and VR/AR where realistic, scene-consistent human motion is essential.

Abstract

In this paper, we tackle the problem of scene-aware 3D human motion forecasting. A key challenge of this task is to predict future human motions that are consistent with the scene by modeling the human-scene interactions. While recent works have demonstrated that explicit constraints on human-scene interactions can prevent the occurrence of ghost motion, they only provide constraints on partial human motion e.g., the global motion of the human or a few joints contacting the scene, leaving the rest of the motion unconstrained. To address this limitation, we propose to model the human-scene interaction with the mutual distance between the human body and the scene. Such mutual distances constrain both the local and global human motion, resulting in a whole-body motion constrained prediction. In particular, mutual distance constraints consist of two components, the signed distance of each vertex on the human mesh to the scene surface and the distance of basis scene points to the human mesh. We further introduce a global scene representation learned from a signed distance function (SDF) volume to ensure coherence between the global scene representation and the explicit constraint from the mutual distance. We develop a pipeline with two sequential steps: predicting the future mutual distances first, followed by forecasting future human motion. During training, we explicitly encourage consistency between predicted poses and mutual distances. Extensive evaluations on the existing synthetic and real datasets demonstrate that our approach consistently outperforms the state-of-the-art methods.

Scene-aware Human Motion Forecasting via Mutual Distance Prediction

TL;DR

The paper tackles scene-aware 3D human motion forecasting by introducing a mutual distance representation that jointly constrains local body pose and global motion. It couples per-vertex signed distances to the scene with per-basis point distances to the human, and leverages a global SDF-based scene representation to maintain coherence between global and local scene cues. The approach predicts mutual distances in a first stage using DCT and GCN encodings, then forecasts future poses with an autoregressive RNN, training with consistency losses derived from the SDF and basis distances. Through evaluations on four datasets, the method consistently outperforms state-of-the-art baselines, with ablations demonstrating the importance of both distance terms and the SDF-based scene representation, indicating stronger, more plausible interactions between humans and scenes. These advances have practical implications for robotics, animation, and VR/AR where realistic, scene-consistent human motion is essential.

Abstract

In this paper, we tackle the problem of scene-aware 3D human motion forecasting. A key challenge of this task is to predict future human motions that are consistent with the scene by modeling the human-scene interactions. While recent works have demonstrated that explicit constraints on human-scene interactions can prevent the occurrence of ghost motion, they only provide constraints on partial human motion e.g., the global motion of the human or a few joints contacting the scene, leaving the rest of the motion unconstrained. To address this limitation, we propose to model the human-scene interaction with the mutual distance between the human body and the scene. Such mutual distances constrain both the local and global human motion, resulting in a whole-body motion constrained prediction. In particular, mutual distance constraints consist of two components, the signed distance of each vertex on the human mesh to the scene surface and the distance of basis scene points to the human mesh. We further introduce a global scene representation learned from a signed distance function (SDF) volume to ensure coherence between the global scene representation and the explicit constraint from the mutual distance. We develop a pipeline with two sequential steps: predicting the future mutual distances first, followed by forecasting future human motion. During training, we explicitly encourage consistency between predicted poses and mutual distances. Extensive evaluations on the existing synthetic and real datasets demonstrate that our approach consistently outperforms the state-of-the-art methods.
Paper Structure (11 sections, 15 equations, 4 figures, 3 tables)

This paper contains 11 sections, 15 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Our mutual distances. a) shows the per-vertex signed distance for sampled vertices on human mesh. Their color indicates the distance value. b) shows the per-basis point distance for the basis points which are not sampled on the scene surface. For both figures, the darker the color is, the smaller the distance is.
  • Figure 2: Network architecture. Given the 3D scene $\mathbf{S}$ represented as a signed distance volume and the past motion $\mathbf{X}$ shown in grey meshes, our approach first predicts the future per-vertex signed distance $\hat{\mathbf{D}}$, and the future per-basis point distance $\hat{\mathbf{B}}$ from historical distance $\mathbf{D}$ and $\mathbf{B}$, respectively. The two predicted future distances are then fed into a RNN-based network to predict the future motion $\hat{\mathbf{Y}}$ shown in orange meshes.
  • Figure 3: This figure compares our method with baseline models on GTA-IM cao2020long (top row), PROX hassan2019resolving (middle row), and HUMANISE wang2022humanise (bottom row). Our method predicts future motion closer to the ground truth.
  • Figure 4: Ablation of the mutual distance. The three figures in the first three columns (from left to right) depict the ground truth pose for the last frame of future motion, results of our full model, and predictions of our model without mutual distance constraint. The sub-figure in gray is the last observed frame. Other sub-figures depict the predicted middle frame. The blue dot is the scene basis point, and the red dot is the sampled vertex on human mesh. The two graphs in the last two columns show the predicted per-vertex signed distance and the per-basis point distance for the red and blue point, respectively. As shown in the figures, with predicted mutual distance, we can forecast the 'stand-up' action more accurately.