Towards more realistic human motion prediction with attention to motion coordination

Pengxiang Ding; Jianqin Yin

Towards more realistic human motion prediction with attention to motion coordination

Pengxiang Ding, Jianqin Yin

TL;DR

This work tackles the realism gap in human motion prediction by explicitly modeling global motion coordination alongside local joint interactions. It introduces Coordination Attractor–based Comprehensive Joint Relation Extractor (CJRE) and Multi-timescale Dynamics Extractor (MTDE), which together capture global joint coordination and enriched intra-joint dynamics. The Global Coordination Extractor (GCE) and Local Interaction Extractor (LIE) within CJRE, plus the Adaptive Feature Fusing Module (AFFM), enable simultaneous consideration of global and local relations, improving MPJPE across H3.6M, CMU-Mocap, and 3DPW for both short- and long-term horizons. Experimental results, including ablations and qualitative visualizations, demonstrate that the proposed framework produces more realistic, coordinated motions with practical impact for robotics, animation, and perception systems.

Abstract

Joint relation modeling is a curial component in human motion prediction. Most existing methods rely on skeletal-based graphs to build the joint relations, where local interactive relations between joint pairs are well learned. However, the motion coordination, a global joint relation reflecting the simultaneous cooperation of all joints, is usually weakened because it is learned from part to whole progressively and asynchronously. Thus, the final predicted motions usually appear unrealistic. To tackle this issue, we learn a medium, called coordination attractor (CA), from the spatiotemporal features of motion to characterize the global motion features, which is subsequently used to build new relative joint relations. Through the CA, all joints are related simultaneously, and thus the motion coordination of all joints can be better learned. Based on this, we further propose a novel joint relation modeling module, Comprehensive Joint Relation Extractor (CJRE), to combine this motion coordination with the local interactions between joint pairs in a unified manner. Additionally, we also present a Multi-timescale Dynamics Extractor (MTDE) to extract enriched dynamics from the raw position information for effective prediction. Extensive experiments show that the proposed framework outperforms state-of-the-art methods in both short- and long-term predictions on H3.6M, CMU-Mocap, and 3DPW.

Towards more realistic human motion prediction with attention to motion coordination

TL;DR

Abstract

Paper Structure (25 sections, 11 equations, 10 figures, 11 tables)

This paper contains 25 sections, 11 equations, 10 figures, 11 tables.

Introduction
Related work
Our Method
Problem formulation
Multi-timescale Dynamics Extractor (MTDE)
Global Coordination Extractor (GCE)
Definition of global motion trends
Feature Normalization Unit
Multi-head Self-attention Unit
Local Interaction Extractor (LIE)
Adaptive Feature Fusing Module (AFFM)
Loss Function
Experiments
Datasets and Implementation Details
Comparison with baselines
...and 10 more sections

Figures (10)

Figure 1: Qualitative results of short-term predictions of motion "discussion" on H3.6M. From top to bottom, we show the ground truth, the results of LTD 08, ConvS2S 19, ResSup 17 and our approach. Compared with the result of our approach, the predicted motions of other works have the same problem: the limbs are uncoordinated which makes the predicted motion appear unrealistic.
Figure 2: The left panel describes the whole framework of our proposed framework and the right two panels represent the details of MTDE and CJRE. Based on the two-stream architecture, the MTDE module is used to extract the enriched motion dynamics. The CJRE module is adapted to encode the global coordination of all joints and local interactions between joint pairs through GCE and LIE, respectively. We here denote ${I^{cjre\_i}}$ and ${O^{cjre\_i}}$ as input and output of the $i$th CJRE module. AFFM is introduced to fuse features according to the channel-wise attention mechanism. The whole CJRE is built based on the bottleneck architecture of ResNet 27 for efficiency. Especially, lateral connections are used to offer fine-grained motion features inspired by U-Net 28. At last, two $1\times 1$ convolutions are successively used to transform temporal and spatial dimensions to get final prediction results.
Figure 3: The obeservation of motion. It shows the duration time of the head is four frames, while the left foot is ten frames.
Figure 4: The overall architecture of GCE. It mainly contains two parts. The Feature Normalization Unit is designed to extract relative joint motion representation without the interference of global motion trends for later global coordination relation modeling. The Multi-head Self-attention Unit is proposed to generate multiple relation graphs of joints to extract richer global coordination. (To simplify the representation, ${X}$ and ${F_{ca}}$ are used to represent ${X^{cjre\_i}}$ and $F_{ca}^{cjre\_i}$, respectively.)
Figure 5: The implementation of Local Interaction Extractor (LIE). The left is the path using a Non-local block without residual connection to learn the relations between distant joint pairs. The right is the the path with convolutions to learn the relations between adjacent joint pairs. (To simplify the representation, ${X}$, ${F_{adjacent}}$ and ${F_{distant}}$ are used to represent ${X^{cjre\_i}}$, $F_{adjacent}^{cjre\_i}$, $F_{distant}^{cjre\_i}$ respectively.)
...and 5 more figures

Towards more realistic human motion prediction with attention to motion coordination

TL;DR

Abstract

Towards more realistic human motion prediction with attention to motion coordination

Authors

TL;DR

Abstract

Table of Contents

Figures (10)