Table of Contents
Fetching ...

Trajectory Prediction for Autonomous Driving Using a Transformer Network

Zhenning Li, Hao Yu

TL;DR

The paper tackles predicting future motions of surrounding agents for autonomous driving by fusing scene context with historical trajectories through a Context-Aware Transformer (CATF). It extends the model with a lighter CATF_l variant that uses linear attention for efficiency, and introduces a multimodal prediction framework with an off-road loss to enforce feasibility. Empirically, CATF and CATF_l achieve state-of-the-art performance on Lyft l5kit across multiple metrics, with CATF_l offering substantially faster inference and lower memory use. The work improves prediction plausibility and safety by aligning forecasts with drivable regions while maintaining high accuracy, enabling more reliable autonomous driving decisions.

Abstract

Predicting the trajectories of surrounding agents is still considered one of the most challenging tasks for autonomous driving. In this paper, we introduce a multi-modal trajectory prediction framework based on the transformer network. The semantic maps of each agent are used as inputs to convolutional networks to automatically derive relevant contextual information. A novel auxiliary loss that penalizes unfeasible off-road predictions is also proposed in this study. Experiments on the Lyft l5kit dataset show that the proposed model achieves state-of-the-art performance, substantially improving the accuracy and feasibility of the prediction outcomes.

Trajectory Prediction for Autonomous Driving Using a Transformer Network

TL;DR

The paper tackles predicting future motions of surrounding agents for autonomous driving by fusing scene context with historical trajectories through a Context-Aware Transformer (CATF). It extends the model with a lighter CATF_l variant that uses linear attention for efficiency, and introduces a multimodal prediction framework with an off-road loss to enforce feasibility. Empirically, CATF and CATF_l achieve state-of-the-art performance on Lyft l5kit across multiple metrics, with CATF_l offering substantially faster inference and lower memory use. The work improves prediction plausibility and safety by aligning forecasts with drivable regions while maintaining high accuracy, enabling more reliable autonomous driving decisions.

Abstract

Predicting the trajectories of surrounding agents is still considered one of the most challenging tasks for autonomous driving. In this paper, we introduce a multi-modal trajectory prediction framework based on the transformer network. The semantic maps of each agent are used as inputs to convolutional networks to automatically derive relevant contextual information. A novel auxiliary loss that penalizes unfeasible off-road predictions is also proposed in this study. Experiments on the Lyft l5kit dataset show that the proposed model achieves state-of-the-art performance, substantially improving the accuracy and feasibility of the prediction outcomes.
Paper Structure (22 sections, 12 equations, 4 figures, 1 table)

This paper contains 22 sections, 12 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Structure of Transformer Network
  • Figure 2: Frameworks of Scaled Dot-Product Attention, Multi-Head Attention, and Multi-Head Linear Attention
  • Figure 3: Examples of different BEV scene rasterization maps including AV (green rectangle) and TVs (blue rectangle).
  • Figure 4: Model comparison in going straight (upper three) and turning scenes (lower three). Green rectangle represents the target agent, blue rectangle represents other agent, darker afterimage indicate the history trajectory ($K=3$, $h=1s$ and $H=5s$)