Table of Contents
Fetching ...

Multi-scale Temporal Fusion Transformer for Incomplete Vehicle Trajectory Prediction

Zhanwen Liu, Chao Li, Yang Wang, Nan Yang, Xing Fan, Jiaqi Ma, Xiangmo Zhao

TL;DR

This work tackles incomplete vehicle trajectory prediction in real-world traffic by proposing a Transformer-based framework, MTFT, that integrates a Multi-scale Attention Head (MAH) and a Continuity Representation-guided Multi-scale Fusion (CRMF) module. MAH extracts parallel multi-scale motion representations from partially observed histories using scale masks, while CRMF fuses these representations under a continuity-guided signal derived from the observation pattern, producing a robust temporal feature for decoding future paths. The approach is end-to-end and avoids separate imputation steps, demonstrating strong improvements across highway and urban datasets, including significant gains on HighD and competitive performance on Argoverse/IArgoverse, especially as missing data increases. The results highlight MTFT’s ability to mitigate the impact of occlusion and perception failures, enabling accurate predictions aligned with overall motion trends without reliance on HD-map priors, with potential for further gains by incorporating scene priors in future work.

Abstract

Motion prediction plays an essential role in autonomous driving systems, enabling autonomous vehicles to achieve more accurate local-path planning and driving decisions based on predictions of the surrounding vehicles. However, existing methods neglect the potential missing values caused by object occlusion, perception failures, etc., which inevitably degrades the trajectory prediction performance in real traffic scenarios. To address this limitation, we propose a novel end-to-end framework for incomplete vehicle trajectory prediction, named Multi-scale Temporal Fusion Transformer (MTFT), which consists of the Multi-scale Attention Head (MAH) and the Continuity Representation-guided Multi-scale Fusion (CRMF) module. Specifically, the MAH leverages the multi-head attention mechanism to parallelly capture multi-scale motion representation of trajectory from different temporal granularities, thus mitigating the adverse effect of missing values on prediction. Furthermore, the multi-scale motion representation is input into the CRMF module for multi-scale fusion to obtain the robust temporal feature of the vehicle. During the fusion process, the continuity representation of vehicle motion is first extracted across time steps to guide the fusion, ensuring that the resulting temporal feature incorporates both detailed information and the overall trend of vehicle motion, which facilitates the accurate decoding of future trajectory that is consistent with the vehicle's motion trend. We evaluate the proposed model on four datasets derived from highway and urban traffic scenarios. The experimental results demonstrate its superior performance in the incomplete vehicle trajectory prediction task compared with state-of-the-art models, e.g., a comprehensive performance improvement of more than 39% on the HighD dataset.

Multi-scale Temporal Fusion Transformer for Incomplete Vehicle Trajectory Prediction

TL;DR

This work tackles incomplete vehicle trajectory prediction in real-world traffic by proposing a Transformer-based framework, MTFT, that integrates a Multi-scale Attention Head (MAH) and a Continuity Representation-guided Multi-scale Fusion (CRMF) module. MAH extracts parallel multi-scale motion representations from partially observed histories using scale masks, while CRMF fuses these representations under a continuity-guided signal derived from the observation pattern, producing a robust temporal feature for decoding future paths. The approach is end-to-end and avoids separate imputation steps, demonstrating strong improvements across highway and urban datasets, including significant gains on HighD and competitive performance on Argoverse/IArgoverse, especially as missing data increases. The results highlight MTFT’s ability to mitigate the impact of occlusion and perception failures, enabling accurate predictions aligned with overall motion trends without reliance on HD-map priors, with potential for further gains by incorporating scene priors in future work.

Abstract

Motion prediction plays an essential role in autonomous driving systems, enabling autonomous vehicles to achieve more accurate local-path planning and driving decisions based on predictions of the surrounding vehicles. However, existing methods neglect the potential missing values caused by object occlusion, perception failures, etc., which inevitably degrades the trajectory prediction performance in real traffic scenarios. To address this limitation, we propose a novel end-to-end framework for incomplete vehicle trajectory prediction, named Multi-scale Temporal Fusion Transformer (MTFT), which consists of the Multi-scale Attention Head (MAH) and the Continuity Representation-guided Multi-scale Fusion (CRMF) module. Specifically, the MAH leverages the multi-head attention mechanism to parallelly capture multi-scale motion representation of trajectory from different temporal granularities, thus mitigating the adverse effect of missing values on prediction. Furthermore, the multi-scale motion representation is input into the CRMF module for multi-scale fusion to obtain the robust temporal feature of the vehicle. During the fusion process, the continuity representation of vehicle motion is first extracted across time steps to guide the fusion, ensuring that the resulting temporal feature incorporates both detailed information and the overall trend of vehicle motion, which facilitates the accurate decoding of future trajectory that is consistent with the vehicle's motion trend. We evaluate the proposed model on four datasets derived from highway and urban traffic scenarios. The experimental results demonstrate its superior performance in the incomplete vehicle trajectory prediction task compared with state-of-the-art models, e.g., a comprehensive performance improvement of more than 39% on the HighD dataset.
Paper Structure (17 sections, 21 equations, 7 figures, 4 tables)

This paper contains 17 sections, 21 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: (a) lists the distribution of missing percentages of trajectory, revealing that most of the trajectory samples have varying percentages of missing values. In the case shown in (b), vehicle 1 is occluded at time ${t_3}$, while vehicle 2 is occluded at time ${t_2}$ and ${t_4}$, resulting in their incomplete trajectory. In contrast, the three cases given in (c) avoid the problem of vehicles occluding through the BEV perspective. However, the perception algorithm only captures the trajectory of most vehicles (marked by pink boxes), while some vehicles (marked by yellow boxes) are not captured due to the failure of the perception algorithm, which brings the incomplete trajectory.
  • Figure 2: Illustration of the proposed MTFT framework. (a) Generate the sequence mask with randomly distributed number and position of masks, which is used to mask the complete trajectory provided by the public dataset to obtain incomplete trajectory. (b) Construct multi-scale attention head by predefined padding mask matrix with different temporal granularities for extracting multi-scale motion representation. (c) Extract multi-scale continuity representation across time steps and then use it as query vector for fusion of multi-scale motion representation. (d) Model the global interaction among all vehicles and output the predicted trajectory of the target vehicle.
  • Figure 3: The computation process for attention head 2, where ${\hat{\textbf{m}}^2}$ the temporal scale of this attention head. The three special values 0, 1 and negative infinity in Formula (5) are represented by the gray, white and pink squares in this figure, respectively.
  • Figure 4: The distribution of the percentages of samples with different missing ratio in dataset IArgoverse.
  • Figure 5: The performance improvement brought by the proposed MAH and CRMF modules on HighD, NGSIM, Argoverse, and IArgoverse datasets.
  • ...and 2 more figures