Table of Contents
Fetching ...

JointMotion: Joint Self-Supervision for Joint Motion Prediction

Royden Wagner, Omer Sahin Tas, Marvin Klemp, Carlos Fernandez

TL;DR

This work presents JointMotion, a self-supervised pre-training method for joint motion prediction in self-driving vehicles that reduces the joint final displacement error of Wayformer, HPTR, and Scene Transformer models, and enables transfer learning between the Waymo Open Motion and the Argoverse 2 Motion Forecasting datasets.

Abstract

We present JointMotion, a self-supervised pre-training method for joint motion prediction in self-driving vehicles. Our method jointly optimizes a scene-level objective connecting motion and environments, and an instance-level objective to refine learned representations. Scene-level representations are learned via non-contrastive similarity learning of past motion sequences and environment context. At the instance level, we use masked autoencoding to refine multimodal polyline representations. We complement this with an adaptive pre-training decoder that enables JointMotion to generalize across different environment representations, fusion mechanisms, and dataset characteristics. Notably, our method reduces the joint final displacement error of Wayformer, HPTR, and Scene Transformer models by 3\%, 8\%, and 12\%, respectively; and enables transfer learning between the Waymo Open Motion and the Argoverse 2 Motion Forecasting datasets. Code: https://github.com/kit-mrt/future-motion

JointMotion: Joint Self-Supervision for Joint Motion Prediction

TL;DR

This work presents JointMotion, a self-supervised pre-training method for joint motion prediction in self-driving vehicles that reduces the joint final displacement error of Wayformer, HPTR, and Scene Transformer models, and enables transfer learning between the Waymo Open Motion and the Argoverse 2 Motion Forecasting datasets.

Abstract

We present JointMotion, a self-supervised pre-training method for joint motion prediction in self-driving vehicles. Our method jointly optimizes a scene-level objective connecting motion and environments, and an instance-level objective to refine learned representations. Scene-level representations are learned via non-contrastive similarity learning of past motion sequences and environment context. At the instance level, we use masked autoencoding to refine multimodal polyline representations. We complement this with an adaptive pre-training decoder that enables JointMotion to generalize across different environment representations, fusion mechanisms, and dataset characteristics. Notably, our method reduces the joint final displacement error of Wayformer, HPTR, and Scene Transformer models by 3\%, 8\%, and 12\%, respectively; and enables transfer learning between the Waymo Open Motion and the Argoverse 2 Motion Forecasting datasets. Code: https://github.com/kit-mrt/future-motion
Paper Structure (10 sections, 1 equation, 4 figures, 3 tables)

This paper contains 10 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: JointMotion. (a) Connecting motion and environments: Our scene-level objective learns joint scene representations via non-contrastive similarity learning of motion sequences $M$ and environment context $E$. (b) Masked polyline modeling: Our instance-level objective refines learned representations via masked autoencoding of multimodal polyline embeddings (i.e., motion, lane, and traffic light data).
  • Figure 2: Adaptive decoding for masked polyline modeling with late and early fusion encoders.(a) Late fusion with modality-specific encoders for agents (Encoder$^\text{A}$), lanes (Encoder$^\text{L}$), and traffic lights (Encoder$^{\text{TL}}$). (b) Early fusion with a shared encoder for all modalities. Compressed features are decoded using learned query tokens.
  • Figure 3: Loss plots of our complementary pre-training objectives. The green curve represents JointMotion w/o CME, while the blue curve represents JointMotion. Consistent with the remainder of the document, $\text{L}$ stands for lanes, $\text{TL}$ stands for traffic lights, and $\text{A}$ stands for agents.
  • Figure 4: Accelerating and improving training via SSL. Scene Transformer models pre-trained with JointMotion achieve higher mAP scores on WOMD than models trained from scratch.