Table of Contents
Fetching ...

T4P: Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-specific Token Memory

Daehee Park, Jaeseok Jeong, Sung-Hoon Yoon, Jaewoo Jeong, Kuk-Jin Yoon

TL;DR

The paper addresses unreliable trajectory prediction under test-time distribution shifts by introducing T4P, a test-time training framework that combines a masked autoencoder (MAE) for deep representation learning with an actor-specific token memory for instance-level adaptation. Offline training optimizes reconstruction and regression losses on source data using a ForecastMAE backbone, while test-time training updates deeper layers on target data using delayed ground-truth and a reconstruction objective to preserve representations. Actor-specific tokens evolve across scenes to capture per-actor motion patterns, enabling robust online adaptation and improved multi-modal predictions. Across nuScenes, Lyft, Waymo, and INTERACTION, T4P achieves state-of-the-art accuracy and efficiency under distribution shifts, demonstrating practical real-time applicability for autonomous driving.

Abstract

Trajectory prediction is a challenging problem that requires considering interactions among multiple actors and the surrounding environment. While data-driven approaches have been used to address this complex problem, they suffer from unreliable predictions under distribution shifts during test time. Accordingly, several online learning methods have been proposed using regression loss from the ground truth of observed data leveraging the auto-labeling nature of trajectory prediction task. We mainly tackle the following two issues. First, previous works underfit and overfit as they only optimize the last layer of the motion decoder. To this end, we employ the masked autoencoder (MAE) for representation learning to encourage complex interaction modeling in shifted test distribution for updating deeper layers. Second, utilizing the sequential nature of driving data, we propose an actor-specific token memory that enables the test-time learning of actor-wise motion characteristics. Our proposed method has been validated across various challenging cross-dataset distribution shift scenarios including nuScenes, Lyft, Waymo, and Interaction. Our method surpasses the performance of existing state-of-the-art online learning methods in terms of both prediction accuracy and computational efficiency. The code is available at https://github.com/daeheepark/T4P.

T4P: Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-specific Token Memory

TL;DR

The paper addresses unreliable trajectory prediction under test-time distribution shifts by introducing T4P, a test-time training framework that combines a masked autoencoder (MAE) for deep representation learning with an actor-specific token memory for instance-level adaptation. Offline training optimizes reconstruction and regression losses on source data using a ForecastMAE backbone, while test-time training updates deeper layers on target data using delayed ground-truth and a reconstruction objective to preserve representations. Actor-specific tokens evolve across scenes to capture per-actor motion patterns, enabling robust online adaptation and improved multi-modal predictions. Across nuScenes, Lyft, Waymo, and INTERACTION, T4P achieves state-of-the-art accuracy and efficiency under distribution shifts, demonstrating practical real-time applicability for autonomous driving.

Abstract

Trajectory prediction is a challenging problem that requires considering interactions among multiple actors and the surrounding environment. While data-driven approaches have been used to address this complex problem, they suffer from unreliable predictions under distribution shifts during test time. Accordingly, several online learning methods have been proposed using regression loss from the ground truth of observed data leveraging the auto-labeling nature of trajectory prediction task. We mainly tackle the following two issues. First, previous works underfit and overfit as they only optimize the last layer of the motion decoder. To this end, we employ the masked autoencoder (MAE) for representation learning to encourage complex interaction modeling in shifted test distribution for updating deeper layers. Second, utilizing the sequential nature of driving data, we propose an actor-specific token memory that enables the test-time learning of actor-wise motion characteristics. Our proposed method has been validated across various challenging cross-dataset distribution shift scenarios including nuScenes, Lyft, Waymo, and Interaction. Our method surpasses the performance of existing state-of-the-art online learning methods in terms of both prediction accuracy and computational efficiency. The code is available at https://github.com/daeheepark/T4P.
Paper Structure (24 sections, 7 equations, 8 figures, 4 tables)

This paper contains 24 sections, 7 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Previous methods optimize the last layer of the decoder using regression loss from delayed ground truth. Our method, on the other hand, learns representation via a masked autoencoder, which boosts prediction performance by optimizing deeper layers. In addition, the proposed actor-specific token enables the prediction model to learn actor-wise motion characteristics.
  • Figure 2: Overall method. During test-time training, the network trained on source dataset is optimized on target data under online setting. The model is optimized both from regression and reconstruction loss. Both losses utilize the data observed at the delayed time stamp ($t_\tau$). Actor-specific token is used to learn instance-wise motion pattern during test-time training phase. During online evaluation phase, model and actor-specific token learned from test-time training phase are used.
  • Figure 3: Actor-specific token memory is colored in gray. It evolves as time passes within a scene. For newborn actors, the corresponding class token is registered. Until the actor disappears, the token is updated through test-time training. At the end of the scene, all tokens are averaged by each class and passed to the next scene as denoted in red arrow and Eq. \ref{['eq:scene_change']}.
  • Figure 4: The first row shows prediction before adaptation, and the second row indicates adaptation results by three methods: ours (blue), TENT w/ sup (orange) and MEK (green). Sky blue and orange boxes refer to surrounding actors and actors to be predicted. We depicted only one actor result and one mode among multi-modal predictions closest to the GT for visual simplicity. Please note that our method is multi-modal prediction for all actors method.
  • Figure 5: The first row indicates masked samples, and the row below shows the reconstructed outputs. The blue/red arrows indicate historical/future trajectories. The black arrows refer to the masked trajectories. The white lines are the lane centerlines, and the gray dashed lines are the masked lane centerlines.
  • ...and 3 more figures