SimpliHuMoN: Simplifying Human Motion Prediction

Aadya Agrawal; Alexander Schwing

SimpliHuMoN: Simplifying Human Motion Prediction

Aadya Agrawal, Alexander Schwing

TL;DR

This work proposes a simple, streamlined, end-to-end transformer-based model that is sufficiently versatile to handle pose-only, trajectory-only, and combined prediction tasks without task-specific modifications.

Abstract

Human motion prediction combines the tasks of trajectory forecasting and human pose prediction. For each of the two tasks, specialized models have been developed. Combining these models for holistic human motion prediction is non-trivial, and recent methods have struggled to compete on established benchmarks for individual tasks. To address this, we propose a simple yet effective transformer-based model for human motion prediction. The model employs a stack of self-attention modules to effectively capture both spatial dependencies within a pose and temporal relationships across a motion sequence. This simple, streamlined, end-to-end model is sufficiently versatile to handle pose-only, trajectory-only, and combined prediction tasks without task-specific modifications. We demonstrate that this approach achieves state-of-the-art results across all tasks through extensive experiments on a wide range of benchmark datasets, including Human3.6M, AMASS, ETH-UCY, and 3DPW.

SimpliHuMoN: Simplifying Human Motion Prediction

TL;DR

Abstract

Paper Structure (53 sections, 5 equations, 7 figures, 13 tables)

This paper contains 53 sections, 5 equations, 7 figures, 13 tables.

Introduction
SimpliHuMoN
Overview of our method.
Input Processing and Embedding Module
Past Context Encoding
Future Query Generation
Transformer Decoder
Multi-Modal Prediction Heads
Implementation Details
Experiments
Datasets
Metrics
Baselines
Quantitative Results
Qualitative Results
...and 38 more sections

Figures (7)

Figure 1: An overview of our architecture. Past observations of 3D poses ($P_{\text{past}}$) and trajectories ($T_{\text{past}}$) are jointly processed by an encoder. Learnable input queries ($\mathcal{Q}_{\text{in}}$), representing potential future states, interact with the encoded past motion within a decoder to produce $K$ distinct future motion proposals ($X_{\text{fut}}^k$) for all agents over a specified horizon.
Figure 2: Visualization of predictions on a MOCAP-UMPM scene. Model predictions are in color, and ground truth future poses are black dashes. The last-known input positions are colored dashes.
Figure 3: Visualization of motion proposals ($K=6$) of our (wide) model on MOCAP-UMPM data. All model predictions are in color. Ground-truth future poses are represented by black dashes, and the last-known input positions are colored dashes.
Figure 4: Distribution of winning mode indices (best-of-6) on pose + trajectory prediction task across the training and validation sets of MOCAP-UMPM. The dashed line (- - -) indicates equal distribution. Both distributions demonstrate balanced mode utilization without mode collapse.
Figure 5: Attention patterns in the first transformer block at epoch 100. Brighter colors indicate stronger attention weights. Dashed lines (- - -) separate past context from future query tokens.
...and 2 more figures

SimpliHuMoN: Simplifying Human Motion Prediction

TL;DR

Abstract

SimpliHuMoN: Simplifying Human Motion Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (7)