Scene-aware Human Motion Forecasting via Mutual Distance Prediction
Chaoyue Xing, Wei Mao, Miaomiao Liu
TL;DR
The paper tackles scene-aware 3D human motion forecasting by introducing a mutual distance representation that jointly constrains local body pose and global motion. It couples per-vertex signed distances to the scene with per-basis point distances to the human, and leverages a global SDF-based scene representation to maintain coherence between global and local scene cues. The approach predicts mutual distances in a first stage using DCT and GCN encodings, then forecasts future poses with an autoregressive RNN, training with consistency losses derived from the SDF and basis distances. Through evaluations on four datasets, the method consistently outperforms state-of-the-art baselines, with ablations demonstrating the importance of both distance terms and the SDF-based scene representation, indicating stronger, more plausible interactions between humans and scenes. These advances have practical implications for robotics, animation, and VR/AR where realistic, scene-consistent human motion is essential.
Abstract
In this paper, we tackle the problem of scene-aware 3D human motion forecasting. A key challenge of this task is to predict future human motions that are consistent with the scene by modeling the human-scene interactions. While recent works have demonstrated that explicit constraints on human-scene interactions can prevent the occurrence of ghost motion, they only provide constraints on partial human motion e.g., the global motion of the human or a few joints contacting the scene, leaving the rest of the motion unconstrained. To address this limitation, we propose to model the human-scene interaction with the mutual distance between the human body and the scene. Such mutual distances constrain both the local and global human motion, resulting in a whole-body motion constrained prediction. In particular, mutual distance constraints consist of two components, the signed distance of each vertex on the human mesh to the scene surface and the distance of basis scene points to the human mesh. We further introduce a global scene representation learned from a signed distance function (SDF) volume to ensure coherence between the global scene representation and the explicit constraint from the mutual distance. We develop a pipeline with two sequential steps: predicting the future mutual distances first, followed by forecasting future human motion. During training, we explicitly encourage consistency between predicted poses and mutual distances. Extensive evaluations on the existing synthetic and real datasets demonstrate that our approach consistently outperforms the state-of-the-art methods.
