Multi-scale Temporal Fusion Transformer for Incomplete Vehicle Trajectory Prediction
Zhanwen Liu, Chao Li, Yang Wang, Nan Yang, Xing Fan, Jiaqi Ma, Xiangmo Zhao
TL;DR
This work tackles incomplete vehicle trajectory prediction in real-world traffic by proposing a Transformer-based framework, MTFT, that integrates a Multi-scale Attention Head (MAH) and a Continuity Representation-guided Multi-scale Fusion (CRMF) module. MAH extracts parallel multi-scale motion representations from partially observed histories using scale masks, while CRMF fuses these representations under a continuity-guided signal derived from the observation pattern, producing a robust temporal feature for decoding future paths. The approach is end-to-end and avoids separate imputation steps, demonstrating strong improvements across highway and urban datasets, including significant gains on HighD and competitive performance on Argoverse/IArgoverse, especially as missing data increases. The results highlight MTFT’s ability to mitigate the impact of occlusion and perception failures, enabling accurate predictions aligned with overall motion trends without reliance on HD-map priors, with potential for further gains by incorporating scene priors in future work.
Abstract
Motion prediction plays an essential role in autonomous driving systems, enabling autonomous vehicles to achieve more accurate local-path planning and driving decisions based on predictions of the surrounding vehicles. However, existing methods neglect the potential missing values caused by object occlusion, perception failures, etc., which inevitably degrades the trajectory prediction performance in real traffic scenarios. To address this limitation, we propose a novel end-to-end framework for incomplete vehicle trajectory prediction, named Multi-scale Temporal Fusion Transformer (MTFT), which consists of the Multi-scale Attention Head (MAH) and the Continuity Representation-guided Multi-scale Fusion (CRMF) module. Specifically, the MAH leverages the multi-head attention mechanism to parallelly capture multi-scale motion representation of trajectory from different temporal granularities, thus mitigating the adverse effect of missing values on prediction. Furthermore, the multi-scale motion representation is input into the CRMF module for multi-scale fusion to obtain the robust temporal feature of the vehicle. During the fusion process, the continuity representation of vehicle motion is first extracted across time steps to guide the fusion, ensuring that the resulting temporal feature incorporates both detailed information and the overall trend of vehicle motion, which facilitates the accurate decoding of future trajectory that is consistent with the vehicle's motion trend. We evaluate the proposed model on four datasets derived from highway and urban traffic scenarios. The experimental results demonstrate its superior performance in the incomplete vehicle trajectory prediction task compared with state-of-the-art models, e.g., a comprehensive performance improvement of more than 39% on the HighD dataset.
