Table of Contents
Fetching ...

Multi-Transmotion: Pre-trained Model for Human Motion Prediction

Yang Gao, Po-Chien Luan, Alexandre Alahi

TL;DR

This paper integrates multiple datasets, encompassing both trajectory and 3D pose keypoints, to propose a pre-trained model for human motion prediction, and introduces Multi-Transmotion, an innovative transformer-based model designed for cross-modality pre-training.

Abstract

The ability of intelligent systems to predict human behaviors is crucial, particularly in fields such as autonomous vehicle navigation and social robotics. However, the complexity of human motion have prevented the development of a standardized dataset for human motion prediction, thereby hindering the establishment of pre-trained models. In this paper, we address these limitations by integrating multiple datasets, encompassing both trajectory and 3D pose keypoints, to propose a pre-trained model for human motion prediction. We merge seven distinct datasets across varying modalities and standardize their formats. To facilitate multimodal pre-training, we introduce Multi-Transmotion, an innovative transformer-based model designed for cross-modality pre-training. Additionally, we present a novel masking strategy to capture rich representations. Our methodology demonstrates competitive performance across various datasets on several downstream tasks, including trajectory prediction in the NBA and JTA datasets, as well as pose prediction in the AMASS and 3DPW datasets. The code is publicly available: https://github.com/vita-epfl/multi-transmotion

Multi-Transmotion: Pre-trained Model for Human Motion Prediction

TL;DR

This paper integrates multiple datasets, encompassing both trajectory and 3D pose keypoints, to propose a pre-trained model for human motion prediction, and introduces Multi-Transmotion, an innovative transformer-based model designed for cross-modality pre-training.

Abstract

The ability of intelligent systems to predict human behaviors is crucial, particularly in fields such as autonomous vehicle navigation and social robotics. However, the complexity of human motion have prevented the development of a standardized dataset for human motion prediction, thereby hindering the establishment of pre-trained models. In this paper, we address these limitations by integrating multiple datasets, encompassing both trajectory and 3D pose keypoints, to propose a pre-trained model for human motion prediction. We merge seven distinct datasets across varying modalities and standardize their formats. To facilitate multimodal pre-training, we introduce Multi-Transmotion, an innovative transformer-based model designed for cross-modality pre-training. Additionally, we present a novel masking strategy to capture rich representations. Our methodology demonstrates competitive performance across various datasets on several downstream tasks, including trajectory prediction in the NBA and JTA datasets, as well as pose prediction in the AMASS and 3DPW datasets. The code is publicly available: https://github.com/vita-epfl/multi-transmotion

Paper Structure

This paper contains 18 sections, 3 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Overview. We propose a unified human motion data framework by standardizing the data format and frame settings. Based on that, we introduce a pre-trained transformer model with specialized masking techniques, validating its effectiveness and flexibility across different scenarios.
  • Figure 2: Multi-Transmotion: A transformer-based model that learns cross-modality representations and social interactions. Sampling mask and bi-directional encoders make the model flexible to different frame settings, while dynamic spatial-temporal mask make pre-training more efficient and robust.
  • Figure 3: Sampling mask and bi-directional encoder.
  • Figure 4: Qualitative results of on NBA Nba2016. The red, blue, and green dots represent the observed historical frames, predicted future frames, and ground truth, respectively. All neighboring players are shown in grey.
  • Figure 5: Qualitative results on dataset. This visualization shows how our model can leverage 3D pose information to augment trajectory prediction. The red trajectory denotes the prediction without using pose knowledge, and the blue trajectory denotes the prediction with help of leverage 3D pose. The green trajectory denotes the ground truth.
  • ...and 2 more figures