Table of Contents
Fetching ...

Robust Human Trajectory Prediction via Self-Supervised Skeleton Representation Learning

Taishu Arashima, Hiroshi Kera, Kazuhiko Kawamoto

TL;DR

A robust trajectory prediction method that incorporates a self-supervised skeleton representation model pretrained with masked autoencoding is proposed that improves robustness to missing skeletal data without sacrificing prediction accuracy, and consistently outperforms baseline models in clean-to-moderate missingness regimes.

Abstract

Human trajectory prediction plays a crucial role in applications such as autonomous navigation and video surveillance. While recent works have explored the integration of human skeleton sequences to complement trajectory information, skeleton data in real-world environments often suffer from missing joints caused by occlusions. These disturbances significantly degrade prediction accuracy, indicating the need for more robust skeleton representations. We propose a robust trajectory prediction method that incorporates a self-supervised skeleton representation model pretrained with masked autoencoding. Experimental results in occlusion-prone scenarios show that our method improves robustness to missing skeletal data without sacrificing prediction accuracy, and consistently outperforms baseline models in clean-to-moderate missingness regimes.

Robust Human Trajectory Prediction via Self-Supervised Skeleton Representation Learning

TL;DR

A robust trajectory prediction method that incorporates a self-supervised skeleton representation model pretrained with masked autoencoding is proposed that improves robustness to missing skeletal data without sacrificing prediction accuracy, and consistently outperforms baseline models in clean-to-moderate missingness regimes.

Abstract

Human trajectory prediction plays a crucial role in applications such as autonomous navigation and video surveillance. While recent works have explored the integration of human skeleton sequences to complement trajectory information, skeleton data in real-world environments often suffer from missing joints caused by occlusions. These disturbances significantly degrade prediction accuracy, indicating the need for more robust skeleton representations. We propose a robust trajectory prediction method that incorporates a self-supervised skeleton representation model pretrained with masked autoencoding. Experimental results in occlusion-prone scenarios show that our method improves robustness to missing skeletal data without sacrificing prediction accuracy, and consistently outperforms baseline models in clean-to-moderate missingness regimes.
Paper Structure (50 sections, 15 equations, 14 figures, 11 tables)

This paper contains 50 sections, 15 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Impact of occlusion-induced missing skeletons on trajectory prediction accuracy. The top row shows a clean (non-occluded) skeleton observation, while the bottom row shows an occluded observation with missing joints. Using an existing human trajectory prediction (HTP) model, prediction accuracy is high under clean inputs (FDE = 0.17 m) but degrades substantially under occluded inputs (FDE = 0.87 m), where FDE denotes the final displacement error.
  • Figure 2: Overview of the self-supervised skeleton learning framework. Random masks are applied to the input skeleton sequence, and a spatio-temporal encoder extracts latent representations from the remaining visible joints. A decoder then reconstructs the original skeleton sequence. Through this process, the model learns structural representations invariant to missing data and noise.
  • Figure 3: Illustration of three mask patterns used for self-supervised skeleton pretraining: (a) Temporally Consistent masks the same joints across all frames; (b) Random masks joints independently at each frame; (c) Body-Part masks multiple joints in the same body part together.
  • Figure 4: Overall architecture of the proposed framework. (a) Self-supervised pretraining: a skeleton encoder is pretrained by reconstructing masked joints from partially observed skeleton sequences. (b) Individual feature extraction: observed trajectories are embedded, and skeleton sequences are encoded by the pretrained (frozen) encoder. Positional encoding (PE) is added to each stream, the resulting tokens are concatenated, and a Cross-Modality Transformer produces an agent-wise representation. (c) Interaction modeling: the agent-wise representations are processed by a Social Transformer to model inter-agent interactions.
  • Figure 5: Reconstruction MPJPE under varying training and test-time mask ratios. (a) MPJPE versus the training mask ratio $r_{\mathrm{train}}$, with one curve for each test-time mask ratio $r_{\mathrm{test}}$. (b) Heatmap over $(r_{\mathrm{train}}, r_{\mathrm{test}})$, where darker colors indicate lower MPJPE.
  • ...and 9 more figures