Table of Contents
Fetching ...

Expressive Forecasting of 3D Whole-body Human Motions

Pengxiang Ding, Qiongjie Cui, Min Zhang, Mengyuan Liu, Haofan Wang, Donglin Wang

TL;DR

This work tackles expressive forecasting of 3D whole-body motions by jointly predicting body joints and hand gestures. It introduces the Encoding-Alignment-Interaction (EAI) framework, comprising Cross-context Alignment (XCA) to reduce heterogeneity across body parts and Cross-context Interaction (XCI) to model intra-body interactivity via cross-attention and wrist-based fusion. The method leverages DCT-based intra-context encoding and a composite loss that balances pose accuracy, gesture alignment, bone-length consistency, and distribution alignment. Evaluations on the GRAB/SMPL-X dataset demonstrate state-of-the-art performance for both short- and long-term horizons, with ablations confirming the importance of XCA and XCI. The approach advances expressive motion forecasting with practical implications for human-robot interaction and related tasks.

Abstract

Human motion forecasting, with the goal of estimating future human behavior over a period of time, is a fundamental task in many real-world applications. However, existing works typically concentrate on predicting the major joints of the human body without considering the delicate movements of the human hands. In practical applications, hand gesture plays an important role in human communication with the real world, and expresses the primary intention of human beings. In this work, we are the first to formulate a whole-body human pose forecasting task, which jointly predicts the future body and hand activities. Correspondingly, we propose a novel Encoding-Alignment-Interaction (EAI) framework that aims to predict both coarse (body joints) and fine-grained (gestures) activities collaboratively, enabling expressive and cross-facilitated forecasting of 3D whole-body human motions. Specifically, our model involves two key constituents: cross-context alignment (XCA) and cross-context interaction (XCI). Considering the heterogeneous information within the whole-body, XCA aims to align the latent features of various human components, while XCI focuses on effectively capturing the context interaction among the human components. We conduct extensive experiments on a newly-introduced large-scale benchmark and achieve state-of-the-art performance. The code is public for research purposes at https://github.com/Dingpx/EAI.

Expressive Forecasting of 3D Whole-body Human Motions

TL;DR

This work tackles expressive forecasting of 3D whole-body motions by jointly predicting body joints and hand gestures. It introduces the Encoding-Alignment-Interaction (EAI) framework, comprising Cross-context Alignment (XCA) to reduce heterogeneity across body parts and Cross-context Interaction (XCI) to model intra-body interactivity via cross-attention and wrist-based fusion. The method leverages DCT-based intra-context encoding and a composite loss that balances pose accuracy, gesture alignment, bone-length consistency, and distribution alignment. Evaluations on the GRAB/SMPL-X dataset demonstrate state-of-the-art performance for both short- and long-term horizons, with ablations confirming the importance of XCA and XCI. The approach advances expressive motion forecasting with practical implications for human-robot interaction and related tasks.

Abstract

Human motion forecasting, with the goal of estimating future human behavior over a period of time, is a fundamental task in many real-world applications. However, existing works typically concentrate on predicting the major joints of the human body without considering the delicate movements of the human hands. In practical applications, hand gesture plays an important role in human communication with the real world, and expresses the primary intention of human beings. In this work, we are the first to formulate a whole-body human pose forecasting task, which jointly predicts the future body and hand activities. Correspondingly, we propose a novel Encoding-Alignment-Interaction (EAI) framework that aims to predict both coarse (body joints) and fine-grained (gestures) activities collaboratively, enabling expressive and cross-facilitated forecasting of 3D whole-body human motions. Specifically, our model involves two key constituents: cross-context alignment (XCA) and cross-context interaction (XCI). Considering the heterogeneous information within the whole-body, XCA aims to align the latent features of various human components, while XCI focuses on effectively capturing the context interaction among the human components. We conduct extensive experiments on a newly-introduced large-scale benchmark and achieve state-of-the-art performance. The code is public for research purposes at https://github.com/Dingpx/EAI.
Paper Structure (11 sections, 14 equations, 5 figures, 3 tables)

This paper contains 11 sections, 14 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Top: Previous works focus on predicting the human major joints, without considering delicate hand movements that are critical to the HRI application. Bottom: To fill this gap, our work proposes a novel task: whole-body human pose forecasting, to jointly predict future both body and gesture activities. We also highlight that within the proposed EAI, both coarse- (major joints) and fine-grained (gestures) properties are cross-facilitated to achieve a higher-fidelity prediction. Here, red/blue pose is the predicted result, while the underlying green is the ground truth.
  • Figure 2: Overall framework of encoding-alignment-interaction (EAI). Given the observed whole-body sequences $\{{\mathbf{X}_{l},\mathbf{X}_{m},\mathbf{X}_{r}}\}$, we first achieve the heterogeneous features $\{{\mathbf{ S}_{l},\mathbf{S}_{m},\mathbf{ S}_{r}}\}$ via intra-context encoding for each body component independently. Since those intra-context lacks the interaction information of components, the cross-context alignment (XCA) and the cross-context interaction (XCI) are the subsequently proposed to extract cross-context information, where the former aims to alleviate the heterogeneity of components to generate homogeneous features while the latter is designed to explore the cross-context interaction according to the homogeneous features $\{{\mathbf{\widetilde{S}}_{l},\mathbf{\widetilde{S}}_{m},\mathbf{\widetilde{S}}_{r}}\}$ from the XCI. The resulting expressive features $\{{\widetilde{\mathbf{F}}_{l},\widetilde{\mathbf{F}}_{m},\widetilde{\mathbf{F}}_{r}}\}$ are then used to predict future whole-body sequences $\{{\hat{\mathbf{Y}}_{l},\hat{\mathbf{Y}}_{m},\hat{\mathbf{Y}}_{r}}\}$.
  • Figure 3: Based on $\{ {{{\textbf{S}}}_{l}},{{{\textbf{S}}}_{m}},{{{\textbf{S}}}_{r}}\}$, XCA applies circular cross neutralization and discrepancy constraint (MMD) to alleviate the heterogeneity across components and generate the homogeneous features.
  • Figure 4: Taking $\{ {{\mathbf{\widetilde{S}}}_{{l}}}, {{\mathbf{\widetilde{S}}}_{{m}}}, {{\mathbf{\widetilde{S}}}_{{r}}} \}$ as the input, the XCI explores the pairwise interactivity of different parts from both the semantic and physical interaction within the whole-body.
  • Figure 5: Predicted whole-body poses visualization (skeleton). The past sequence is in a grey box, and the predicted ones are in yellow boxes. The GT and predicted poses are denoted as green and blue/red skeletons, respectively. As highlighted by the dashed ellipse boxes, both performances of fine-grained (body) and coarse-grained (gestures) motion are considered. This evidences that it is indeed beneficial to simultaneously eliminate the heterogeneity of different human components and then extract the interaction within the whole-body.