Robust Human Trajectory Prediction via Self-Supervised Skeleton Representation Learning

Taishu Arashima; Hiroshi Kera; Kazuhiko Kawamoto

Robust Human Trajectory Prediction via Self-Supervised Skeleton Representation Learning

Taishu Arashima, Hiroshi Kera, Kazuhiko Kawamoto

TL;DR

A robust trajectory prediction method that incorporates a self-supervised skeleton representation model pretrained with masked autoencoding is proposed that improves robustness to missing skeletal data without sacrificing prediction accuracy, and consistently outperforms baseline models in clean-to-moderate missingness regimes.

Abstract

Human trajectory prediction plays a crucial role in applications such as autonomous navigation and video surveillance. While recent works have explored the integration of human skeleton sequences to complement trajectory information, skeleton data in real-world environments often suffer from missing joints caused by occlusions. These disturbances significantly degrade prediction accuracy, indicating the need for more robust skeleton representations. We propose a robust trajectory prediction method that incorporates a self-supervised skeleton representation model pretrained with masked autoencoding. Experimental results in occlusion-prone scenarios show that our method improves robustness to missing skeletal data without sacrificing prediction accuracy, and consistently outperforms baseline models in clean-to-moderate missingness regimes.

Robust Human Trajectory Prediction via Self-Supervised Skeleton Representation Learning

TL;DR

Abstract

Paper Structure (50 sections, 15 equations, 14 figures, 11 tables)

This paper contains 50 sections, 15 equations, 14 figures, 11 tables.

Introduction
Related Work
Proposed Method
Self-Supervised Skeleton Representation Learning
Skeleton Representation
Architecture
Masking Strategy
Reconstruction Loss
Integration into Human Trajectory Prediction
Integration Strategy
Overall Architecture
Loss Function
Experiments
Datasets
Self-Supervised Skeleton Representation Learning
...and 35 more sections

Figures (14)

Figure 1: Impact of occlusion-induced missing skeletons on trajectory prediction accuracy. The top row shows a clean (non-occluded) skeleton observation, while the bottom row shows an occluded observation with missing joints. Using an existing human trajectory prediction (HTP) model, prediction accuracy is high under clean inputs (FDE = 0.17 m) but degrades substantially under occluded inputs (FDE = 0.87 m), where FDE denotes the final displacement error.
Figure 2: Overview of the self-supervised skeleton learning framework. Random masks are applied to the input skeleton sequence, and a spatio-temporal encoder extracts latent representations from the remaining visible joints. A decoder then reconstructs the original skeleton sequence. Through this process, the model learns structural representations invariant to missing data and noise.
Figure 3: Illustration of three mask patterns used for self-supervised skeleton pretraining: (a) Temporally Consistent masks the same joints across all frames; (b) Random masks joints independently at each frame; (c) Body-Part masks multiple joints in the same body part together.
Figure 4: Overall architecture of the proposed framework. (a) Self-supervised pretraining: a skeleton encoder is pretrained by reconstructing masked joints from partially observed skeleton sequences. (b) Individual feature extraction: observed trajectories are embedded, and skeleton sequences are encoded by the pretrained (frozen) encoder. Positional encoding (PE) is added to each stream, the resulting tokens are concatenated, and a Cross-Modality Transformer produces an agent-wise representation. (c) Interaction modeling: the agent-wise representations are processed by a Social Transformer to model inter-agent interactions.
Figure 5: Reconstruction MPJPE under varying training and test-time mask ratios. (a) MPJPE versus the training mask ratio $r_{\mathrm{train}}$, with one curve for each test-time mask ratio $r_{\mathrm{test}}$. (b) Heatmap over $(r_{\mathrm{train}}, r_{\mathrm{test}})$, where darker colors indicate lower MPJPE.
...and 9 more figures

Robust Human Trajectory Prediction via Self-Supervised Skeleton Representation Learning

TL;DR

Abstract

Robust Human Trajectory Prediction via Self-Supervised Skeleton Representation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (14)