Table of Contents
Fetching ...

CDKFormer: Contextual Deviation Knowledge-Based Transformer for Long-Tail Trajectory Prediction

Yuansheng Lian, Ke Zhang, Meng Li

TL;DR

<3-5 sentence high-level summary> CDKFormer tackles the rare and challenging long-tail trajectory prediction problem for autonomous vehicles by introducing contextual deviation features and a dual query-based Transformer decoder. It jointly encodes scene context and deviation status, then decodes with mode and dual future queries through a multistream decoder to generate robust multimodal trajectories. The method achieves state-of-the-art results on Argoverse 2 and inD, with strong tail performance demonstrated via CVaR analysis and comprehensive ablations. It also emphasizes the need for future work on map-aware deviation modeling and causal analysis of tail failures to further improve safety and reliability in real-world traffic.

Abstract

Predicting the future movements of surrounding vehicles is essential for ensuring the safe operation and efficient navigation of autonomous vehicles (AVs) in urban traffic environments. Existing vehicle trajectory prediction methods primarily focus on improving overall performance, yet they struggle to address long-tail scenarios effectively. This limitation often leads to poor predictions in rare cases, significantly increasing the risk of safety incidents. Taking Argoverse 2 motion forecasting dataset as an example, we first investigate the long-tail characteristics in trajectory samples from two perspectives, individual motion and group interaction, and deriving deviation features to distinguish abnormal from regular scenarios. On this basis, we propose CDKFormer, a Contextual Deviation Knowledge-based Transformer model for long-tail trajectory prediction. CDKFormer integrates an attention-based scene context fusion module to encode spatiotemporal interaction and road topology. An additional deviation feature fusion module is proposed to capture the dynamic deviations in the target vehicle status. We further introduce a dual query-based decoder, supported by a multi-stream decoder block, to sequentially decode heterogeneous scene deviation features and generate multimodal trajectory predictions. Extensive experiments demonstrate that CDKFormer achieves state-of-the-art performance, significantly enhancing prediction accuracy and robustness for long-tailed trajectories compared to existing methods, thus advancing the reliability of AVs in complex real-world environments.

CDKFormer: Contextual Deviation Knowledge-Based Transformer for Long-Tail Trajectory Prediction

TL;DR

<3-5 sentence high-level summary> CDKFormer tackles the rare and challenging long-tail trajectory prediction problem for autonomous vehicles by introducing contextual deviation features and a dual query-based Transformer decoder. It jointly encodes scene context and deviation status, then decodes with mode and dual future queries through a multistream decoder to generate robust multimodal trajectories. The method achieves state-of-the-art results on Argoverse 2 and inD, with strong tail performance demonstrated via CVaR analysis and comprehensive ablations. It also emphasizes the need for future work on map-aware deviation modeling and causal analysis of tail failures to further improve safety and reliability in real-world traffic.

Abstract

Predicting the future movements of surrounding vehicles is essential for ensuring the safe operation and efficient navigation of autonomous vehicles (AVs) in urban traffic environments. Existing vehicle trajectory prediction methods primarily focus on improving overall performance, yet they struggle to address long-tail scenarios effectively. This limitation often leads to poor predictions in rare cases, significantly increasing the risk of safety incidents. Taking Argoverse 2 motion forecasting dataset as an example, we first investigate the long-tail characteristics in trajectory samples from two perspectives, individual motion and group interaction, and deriving deviation features to distinguish abnormal from regular scenarios. On this basis, we propose CDKFormer, a Contextual Deviation Knowledge-based Transformer model for long-tail trajectory prediction. CDKFormer integrates an attention-based scene context fusion module to encode spatiotemporal interaction and road topology. An additional deviation feature fusion module is proposed to capture the dynamic deviations in the target vehicle status. We further introduce a dual query-based decoder, supported by a multi-stream decoder block, to sequentially decode heterogeneous scene deviation features and generate multimodal trajectory predictions. Extensive experiments demonstrate that CDKFormer achieves state-of-the-art performance, significantly enhancing prediction accuracy and robustness for long-tailed trajectories compared to existing methods, thus advancing the reliability of AVs in complex real-world environments.

Paper Structure

This paper contains 35 sections, 15 equations, 13 figures, 9 tables, 1 algorithm.

Figures (13)

  • Figure 1: Rarity score distribution. (a) Spatial rarity score distribution. The trajectory endpoints are fitted to a GMM. The score is the negative log-likelihood of an endpoint under this distribution. (b) Temporal rarity score distribution. Calculated from a GMM fitted on the low-dimensional FPCA scores of the full trajectories. (c) Final rarity score distribution. The final score is the square root product of the spatial and temporal rarity scores. All scores are normalized to [0, 1], with higher scores indicating higher rarity. Both GMMs have 10 components, which is selected based on minimizing Bayesian information criterion.
  • Figure 2: Tail score distribution of the training samples in Argoverse 2 motion forecasting dataset. Tail score is calculated as the production of difficulty score and rarity score. Tail score is shown in log-scale.
  • Figure 3: Distribution of speed difference, speed standard deviation, heading difference and heading standard deviation of top 10% head and tail samples. The y-axis (density) is in log scale.
  • Figure 4: Distribution of relative speed and heading of top 10% head and top 10% tail samples.
  • Figure 5: Overview of the proposed CDKFormer architecture. The model first encode the agent motion and scene contextual information with self-attention-based encoders. The deviation and motion features of the target vehicle are jointly fused in a deviation fusion module. The scene context and deviation information are subsequently decoded by a mode query and dual future queries, including a regular future query and a tail future query, within multistream decoder blocks. Then, a scene query is obtained by combining the mode query and weighted combined future query. This scene query is further refined and used for multimodal trajectories generation. $\times N$ denotes $N$ stacked layers. (R)Future and (T)Future denote regular future query and tail future query, respectively.
  • ...and 8 more figures