Multimodal Sense-Informed Prediction of 3D Human Motions

Zhenyu Lou; Qiongjie Cui; Haofan Wang; Xu Tang; Hong Zhou

Multimodal Sense-Informed Prediction of 3D Human Motions

Zhenyu Lou, Qiongjie Cui, Haofan Wang, Xu Tang, Hong Zhou

TL;DR

The paper tackles the problem of predicting future 3D human motion in realistic environments by conditioning on external 3D scene data and internal human gaze. It introduces SIF3D, a multimodal framework that uses MotionEncoder and PointNet++ to encode motion and scene, coupled with two cross-modal attentions—ternary intention-aware attention ($ ext{TIA}$) and semantic coherence-aware attention ($ ext{SCA}$)—to generate accurate trajectory $oldsymbol{T}$, orientation $oldsymbol{O}$, and pose $oldsymbol{P}$, refined by a MotionDecoder and a geometry discriminator. A detailed problem setup and architecture enable end-to-end learning for long-horizon predictions, explicitly modeling salient scene interactions and human intention. Evaluations on GIMO and GTA-1M demonstrate state-of-the-art performance in both trajectory deviation and MPJPE, validating the benefit of scene salience and gaze-guided planning for realistic 3D motion generation. The approach has practical impact for robot planning and human-robot collaboration in real-world settings, and suggests future work in leveraging richer modalities and scalable scene representations.

Abstract

Predicting future human pose is a fundamental application for machine intelligence, which drives robots to plan their behavior and paths ahead of time to seamlessly accomplish human-robot collaboration in real-world 3D scenarios. Despite encouraging results, existing approaches rarely consider the effects of the external scene on the motion sequence, leading to pronounced artifacts and physical implausibilities in the predictions. To address this limitation, this work introduces a novel multi-modal sense-informed motion prediction approach, which conditions high-fidelity generation on two modal information: external 3D scene, and internal human gaze, and is able to recognize their salience for future human activity. Furthermore, the gaze information is regarded as the human intention, and combined with both motion and scene features, we construct a ternary intention-aware attention to supervise the generation to match where the human wants to reach. Meanwhile, we introduce semantic coherence-aware attention to explicitly distinguish the salient point clouds and the underlying ones, to ensure a reasonable interaction of the generated sequence with the 3D scene. On two real-world benchmarks, the proposed method achieves state-of-the-art performance both in 3D human pose and trajectory prediction.

Multimodal Sense-Informed Prediction of 3D Human Motions

TL;DR

) and semantic coherence-aware attention (

)—to generate accurate trajectory

, orientation

, and pose

, refined by a MotionDecoder and a geometry discriminator. A detailed problem setup and architecture enable end-to-end learning for long-horizon predictions, explicitly modeling salient scene interactions and human intention. Evaluations on GIMO and GTA-1M demonstrate state-of-the-art performance in both trajectory deviation and MPJPE, validating the benefit of scene salience and gaze-guided planning for realistic 3D motion generation. The approach has practical impact for robot planning and human-robot collaboration in real-world settings, and suggests future work in leveraging richer modalities and scalable scene representations.

Abstract

Paper Structure (16 sections, 13 equations, 3 figures, 6 tables)

This paper contains 16 sections, 13 equations, 3 figures, 6 tables.

Introduction
Related Work
Proposed Method
Problem Setup
Multimodal Encoding
Ternary Intention-Aware Attention
Semantic Coherence-Aware Attention
Motion Sequence Generation
Implement Details
Experiments
Experimental Setup
Ablation of Multiple Modalities
Detailed Results
Visualizations
Ablation Studies
...and 1 more sections

Figures (3)

Figure 1: The proposed SIF3D: multimodal Sense-Informed Forecasting of 3D human motions. Our SIF3D takes the observed motion sequence, as well as the 3D scene point cloud as input modalities, and is able to identify salient points (redder) and underlying ones (bluer), to generate the accurate trajectory and high-fidelity future poses within given 3D scenarios. In contrast, the state-of-the-art baseline of BiFu zheng2022gimo equally considers the global scene embedding, and thus cannot distinguish the saliency of the 3D scene, leading to the physically implausible motions, e.g., human mesh intersecting or distorting with the 3D environment, violating any physical constraints.
Figure 2: The architecture of SIF3D. SIF3D incorporates three modalities of input, the past motion sequence, the 3D scene point cloud, and the human gaze. First, MotionEncoder encodes past motion sequence into a motion embedding $\boldsymbol{f}_m$, and the 3D scene $\boldsymbol{S}$ is encoded into $\{\boldsymbol{\hat{S}}, \boldsymbol{\hat{S}}_{global}\}$ through PointNet++ qi2017pointnet++. Then, our TIA mechanism compresses motion embedding in the temporal dimension and searches for global salient points in the scene for trajectory planning. In addition, human gaze point $\boldsymbol{G}$ is introduced to index the scene point cloud for gaze point scene feature extraction. The SCA mechanism, on the other hand, is designed to capture local salient points in the scene for each independent pose. A TrajectoryPlanner and a PosePredictor are applied to predict trajectory and poses, respectively. And finally, the predicted motion sequence is generated through a MotionDecoder, which is supervised by the geometric discriminator.
Figure 3: Visualizations of our SIF3D compared with the SoTA BiFu, under the scenarios of (a) living room and (b) bedroom. The top is the results of BiFu zheng2022gimo, which equally treats all scene points; in contrast, the middle row is our SIF3D, where the salient points are highlighted in red, and the underlying points are in blue. For the sake of clarity, the predicted sequence is presented from the vertical view, whole-seq view, and end-pose view. We note that the red human meshes are the ground truth, while the blue ones indicate the predictions. At the bottom, we present the local scene salience heatmap across time for SIF3D, with a time interval of 2 seconds.

Multimodal Sense-Informed Prediction of 3D Human Motions

TL;DR

Abstract

Multimodal Sense-Informed Prediction of 3D Human Motions

Authors

TL;DR

Abstract

Table of Contents

Figures (3)