T4P: Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-specific Token Memory
Daehee Park, Jaeseok Jeong, Sung-Hoon Yoon, Jaewoo Jeong, Kuk-Jin Yoon
TL;DR
The paper addresses unreliable trajectory prediction under test-time distribution shifts by introducing T4P, a test-time training framework that combines a masked autoencoder (MAE) for deep representation learning with an actor-specific token memory for instance-level adaptation. Offline training optimizes reconstruction and regression losses on source data using a ForecastMAE backbone, while test-time training updates deeper layers on target data using delayed ground-truth and a reconstruction objective to preserve representations. Actor-specific tokens evolve across scenes to capture per-actor motion patterns, enabling robust online adaptation and improved multi-modal predictions. Across nuScenes, Lyft, Waymo, and INTERACTION, T4P achieves state-of-the-art accuracy and efficiency under distribution shifts, demonstrating practical real-time applicability for autonomous driving.
Abstract
Trajectory prediction is a challenging problem that requires considering interactions among multiple actors and the surrounding environment. While data-driven approaches have been used to address this complex problem, they suffer from unreliable predictions under distribution shifts during test time. Accordingly, several online learning methods have been proposed using regression loss from the ground truth of observed data leveraging the auto-labeling nature of trajectory prediction task. We mainly tackle the following two issues. First, previous works underfit and overfit as they only optimize the last layer of the motion decoder. To this end, we employ the masked autoencoder (MAE) for representation learning to encourage complex interaction modeling in shifted test distribution for updating deeper layers. Second, utilizing the sequential nature of driving data, we propose an actor-specific token memory that enables the test-time learning of actor-wise motion characteristics. Our proposed method has been validated across various challenging cross-dataset distribution shift scenarios including nuScenes, Lyft, Waymo, and Interaction. Our method surpasses the performance of existing state-of-the-art online learning methods in terms of both prediction accuracy and computational efficiency. The code is available at https://github.com/daeheepark/T4P.
