Table of Contents
Fetching ...

Enhancing Human Motion Prediction via Multi-range Decoupling Decoding with Gating-adjusting Aggregation

Jiexin Wang, Wenwen Qiang, Zhao Yang, Bing Su

TL;DR

The paper addresses horizon-dependent temporal correlations in human motion prediction by introducing MD2GA, a two-stage framework that (i) decouples decoding across multiple future ranges with a Multi-range Decoupling Decoder (MDD) and (ii) fuses the horizon-specific predictions with a gating-adjusting aggregation (GA). The MDD uses $K$ decoders to produce outputs $Y_k$ at horizons $L_k$, while GA computes mixing weights $oldsymbol{ ext{w}}$ via a lightweight gating network and blends outputs with an attention mask $A_{k,t}$. The method is designed to be easily integrated with existing HMP models and is trained with a joint loss $oldsymbol{ ext{L}}=oldsymbol{ ext{L}}_1+oldsymbol{ ext{L}}_2$ that encourages horizon-specific decodings to align with a shared representation. Experiments on H3.6M, CMU-Mocap, and 3DPW show consistent MPJPE reductions across short- and long-term predictions, demonstrating improved motion representation learning and robustness across architectures. The approach offers practical benefits for real-world motion prediction systems due to its simplicity and wide compatibility.

Abstract

Expressive representation of pose sequences is crucial for accurate motion modeling in human motion prediction (HMP). While recent deep learning-based methods have shown promise in learning motion representations, these methods tend to overlook the varying relevance and dependencies between historical information and future moments, with a stronger correlation for short-term predictions and weaker for distant future predictions. This limits the learning of motion representation and then hampers prediction performance. In this paper, we propose a novel approach called multi-range decoupling decoding with gating-adjusting aggregation ($MD2GA$), which leverages the temporal correlations to refine motion representation learning. This approach employs a two-stage strategy for HMP. In the first stage, a multi-range decoupling decoding adeptly adjusts feature learning by decoding the shared features into distinct future lengths, where different decoders offer diverse insights into motion patterns. In the second stage, a gating-adjusting aggregation dynamically combines the diverse insights guided by input motion data. Extensive experiments demonstrate that the proposed method can be easily integrated into other motion prediction methods and enhance their prediction performance.

Enhancing Human Motion Prediction via Multi-range Decoupling Decoding with Gating-adjusting Aggregation

TL;DR

The paper addresses horizon-dependent temporal correlations in human motion prediction by introducing MD2GA, a two-stage framework that (i) decouples decoding across multiple future ranges with a Multi-range Decoupling Decoder (MDD) and (ii) fuses the horizon-specific predictions with a gating-adjusting aggregation (GA). The MDD uses decoders to produce outputs at horizons , while GA computes mixing weights via a lightweight gating network and blends outputs with an attention mask . The method is designed to be easily integrated with existing HMP models and is trained with a joint loss that encourages horizon-specific decodings to align with a shared representation. Experiments on H3.6M, CMU-Mocap, and 3DPW show consistent MPJPE reductions across short- and long-term predictions, demonstrating improved motion representation learning and robustness across architectures. The approach offers practical benefits for real-world motion prediction systems due to its simplicity and wide compatibility.

Abstract

Expressive representation of pose sequences is crucial for accurate motion modeling in human motion prediction (HMP). While recent deep learning-based methods have shown promise in learning motion representations, these methods tend to overlook the varying relevance and dependencies between historical information and future moments, with a stronger correlation for short-term predictions and weaker for distant future predictions. This limits the learning of motion representation and then hampers prediction performance. In this paper, we propose a novel approach called multi-range decoupling decoding with gating-adjusting aggregation (), which leverages the temporal correlations to refine motion representation learning. This approach employs a two-stage strategy for HMP. In the first stage, a multi-range decoupling decoding adeptly adjusts feature learning by decoding the shared features into distinct future lengths, where different decoders offer diverse insights into motion patterns. In the second stage, a gating-adjusting aggregation dynamically combines the diverse insights guided by input motion data. Extensive experiments demonstrate that the proposed method can be easily integrated into other motion prediction methods and enhance their prediction performance.

Paper Structure

This paper contains 21 sections, 13 equations, 11 figures, 15 tables.

Figures (11)

  • Figure 1: Performance across various future prediction horizons in toy experiments. Experiments are conducted on a GCN/LSTM/Tansformer-based neural network utilizing 10 initial poses as input, with Pre-$x$ denoting the corresponding prediction sequence length ($x$) in the Human 3.6M dataset ionescu2013human3. The figure demonstrates the increasing difficulty of predicting farther into the future, as evidenced by performance degradation from the 1st to the 10th frame. Moreover, it shows that wider prediction horizons lead to poorer short-term prediction performance, as seen in the comparison of the shared prediction horizons between different Pre-$x$ settings.
  • Figure 2: Illustration of our multi-range decoupling decoding with gating-adjusting aggregation framework. Within our approach, we expand the mainstream framework $\mathcal{F}_\mathrm{{pred}}$ comprising an encoder $\varphi$ ($e.g.$ graph neural networks) and a decoder $g$. We extend $g$ into multi-range decoupling decoding, adjusting adaptively feature learning by decoding shared features into distinct future lengths, transitioning from the motion feature $M$ to $M'$. Furthermore, we propose a dynamic gating-adjusting aggregation mechanism to combine diverse insights derived from the multi-range decoupling decoding.
  • Figure 3: T-SNE visualization of predicted human motion. The red represents ground truth, and the yellow depicts the motion features predicted by the model.
  • Figure 4: Ablation on the number of decoding. Average MPJPE of short-term motion prediction is reported.
  • Figure 5: Qualitative visualization of the adaptive attention across different actions in SPGSN with our method.
  • ...and 6 more figures