Table of Contents
Fetching ...

SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction

Yang Zhou, Hao Shao, Letian Wang, Steven L. Waslander, Hongsheng Li, Yu Liu

TL;DR

SmartPretrain tackles the limited data problem in motion prediction for autonomous driving by introducing a model- and dataset-agnostic SSL framework that blends trajectory contrastive learning (TCL) and trajectory reconstruction learning (TRL) within a dataset-agnostic sampling pipeline. It standardizes inputs across multiple driving datasets and uses two sub-scenarios to create robust positive and negative pairs, enabling cross-dataset generalization. Empirically, it improves state-of-the-art predictors across Argoverse, Argoverse 2, and WOMD, with data-scaled pre-training delivering the largest gains and outperforming existing pre-training approaches. This approach offers a scalable path to robust, transferable motion representations in the small-data regime and is released with open-source code for broad adoption.

Abstract

Predicting the future motion of surrounding agents is essential for autonomous vehicles (AVs) to operate safely in dynamic, human-robot-mixed environments. However, the scarcity of large-scale driving datasets has hindered the development of robust and generalizable motion prediction models, limiting their ability to capture complex interactions and road geometries. Inspired by recent advances in natural language processing (NLP) and computer vision (CV), self-supervised learning (SSL) has gained significant attention in the motion prediction community for learning rich and transferable scene representations. Nonetheless, existing pre-training methods for motion prediction have largely focused on specific model architectures and single dataset, limiting their scalability and generalizability. To address these challenges, we propose SmartPretrain, a general and scalable SSL framework for motion prediction that is both model-agnostic and dataset-agnostic. Our approach integrates contrastive and reconstructive SSL, leveraging the strengths of both generative and discriminative paradigms to effectively represent spatiotemporal evolution and interactions without imposing architectural constraints. Additionally, SmartPretrain employs a dataset-agnostic scenario sampling strategy that integrates multiple datasets, enhancing data volume, diversity, and robustness. Extensive experiments on multiple datasets demonstrate that SmartPretrain consistently improves the performance of state-of-the-art prediction models across datasets, data splits and main metrics. For instance, SmartPretrain significantly reduces the MissRate of Forecast-MAE by 10.6%. These results highlight SmartPretrain's effectiveness as a unified, scalable solution for motion prediction, breaking free from the limitations of the small-data regime. Codes are available at https://github.com/youngzhou1999/SmartPretrain

SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction

TL;DR

SmartPretrain tackles the limited data problem in motion prediction for autonomous driving by introducing a model- and dataset-agnostic SSL framework that blends trajectory contrastive learning (TCL) and trajectory reconstruction learning (TRL) within a dataset-agnostic sampling pipeline. It standardizes inputs across multiple driving datasets and uses two sub-scenarios to create robust positive and negative pairs, enabling cross-dataset generalization. Empirically, it improves state-of-the-art predictors across Argoverse, Argoverse 2, and WOMD, with data-scaled pre-training delivering the largest gains and outperforming existing pre-training approaches. This approach offers a scalable path to robust, transferable motion representations in the small-data regime and is released with open-source code for broad adoption.

Abstract

Predicting the future motion of surrounding agents is essential for autonomous vehicles (AVs) to operate safely in dynamic, human-robot-mixed environments. However, the scarcity of large-scale driving datasets has hindered the development of robust and generalizable motion prediction models, limiting their ability to capture complex interactions and road geometries. Inspired by recent advances in natural language processing (NLP) and computer vision (CV), self-supervised learning (SSL) has gained significant attention in the motion prediction community for learning rich and transferable scene representations. Nonetheless, existing pre-training methods for motion prediction have largely focused on specific model architectures and single dataset, limiting their scalability and generalizability. To address these challenges, we propose SmartPretrain, a general and scalable SSL framework for motion prediction that is both model-agnostic and dataset-agnostic. Our approach integrates contrastive and reconstructive SSL, leveraging the strengths of both generative and discriminative paradigms to effectively represent spatiotemporal evolution and interactions without imposing architectural constraints. Additionally, SmartPretrain employs a dataset-agnostic scenario sampling strategy that integrates multiple datasets, enhancing data volume, diversity, and robustness. Extensive experiments on multiple datasets demonstrate that SmartPretrain consistently improves the performance of state-of-the-art prediction models across datasets, data splits and main metrics. For instance, SmartPretrain significantly reduces the MissRate of Forecast-MAE by 10.6%. These results highlight SmartPretrain's effectiveness as a unified, scalable solution for motion prediction, breaking free from the limitations of the small-data regime. Codes are available at https://github.com/youngzhou1999/SmartPretrain

Paper Structure

This paper contains 19 sections, 2 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Illustration comparing existing trajectory prediction pre-training pipelines with ours. Our pipeline can unlock performance gains from all models as it is model-agnostic, while most existing pipelines are model-specific and inflexible.
  • Figure 2: Overview of our model-agnostic and dataset-agnostic pre-training pipeline. We begin by randomly sampling a training scenario from mixed datasets. From this scenario, two sub-scenarios with different temporal timelines are randomly sampled and fed into two model branches to generate trajectory embedding in the scene. Two model-agnostic pretext tasks, trajectory contrastive learning and reconstructive learning, are introduced to learn transferable and robust representations.
  • Figure 3: Ablation study on pre-training epochs and batch sizes. Larger pre-training epochs and batch sizes enhance performance, while diminishing returns are observed beyond certain levels.
  • Figure 4: Visulization Results of Trajectory Alignment. The blue arrows are the model's multi-modal trajectory predictions for the target agent, and the pink arrow is the ground truth future trajectory. After pre-training, the predicted trajectories get closer to the ground truth.
  • Figure 5: Visulization Results of Long Trajectories. The blue arrows are the model's multi-modal trajectory predictions for the target agent, and the pink arrow is the ground truth future trajectory. Our pre-training method enables more accurate trajectory predictions in the long term.
  • ...and 3 more figures