Table of Contents
Fetching ...

Trajectory Prediction for Autonomous Driving based on Multi-Head Attention with Joint Agent-Map Representation

Kaouther Messaoud, Nachiket Deo, Mohan M. Trivedi, Fawzi Nashashibi

TL;DR

The paper tackles multimodal trajectory prediction for autonomous driving by introducing MHA-JAM, a multi-head attention model that operates on a joint agent-map representation. Each attention head specializes to a different future mode, enabling diverse, scene-consistent trajectory predictions via a mixture model with parameters $\Theta_l^t$ and component probabilities $P_l$. The method yields state-of-the-art results on the nuScenes prediction benchmark, improving metrics such as MinADE$_k$ and MissRate$_{k,d}$, and benefits from an off-road loss that enforces drivable-area conformity. Overall, the work demonstrates that joint spatio-temporal context and head-specific multimodal reasoning substantially enhance both accuracy and safety-adherent trajectory forecasting in urban driving scenarios.

Abstract

Predicting the trajectories of surrounding agents is an essential ability for autonomous vehicles navigating through complex traffic scenes. The future trajectories of agents can be inferred using two important cues: the locations and past motion of agents, and the static scene structure. Due to the high variability in scene structure and agent configurations, prior work has employed the attention mechanism, applied separately to the scene and agent configuration to learn the most salient parts of both cues. However, the two cues are tightly linked. The agent configuration can inform what part of the scene is most relevant to prediction. The static scene in turn can help determine the relative influence of agents on each other's motion. Moreover, the distribution of future trajectories is multimodal, with modes corresponding to the agent's intent. The agent's intent also informs what part of the scene and agent configuration is relevant to prediction. We thus propose a novel approach applying multi-head attention by considering a joint representation of the static scene and surrounding agents. We use each attention head to generate a distinct future trajectory to address multimodality of future trajectories. Our model achieves state of the art results on the nuScenes prediction benchmark and generates diverse future trajectories compliant with scene structure and agent configuration.

Trajectory Prediction for Autonomous Driving based on Multi-Head Attention with Joint Agent-Map Representation

TL;DR

The paper tackles multimodal trajectory prediction for autonomous driving by introducing MHA-JAM, a multi-head attention model that operates on a joint agent-map representation. Each attention head specializes to a different future mode, enabling diverse, scene-consistent trajectory predictions via a mixture model with parameters and component probabilities . The method yields state-of-the-art results on the nuScenes prediction benchmark, improving metrics such as MinADE and MissRate, and benefits from an off-road loss that enforces drivable-area conformity. Overall, the work demonstrates that joint spatio-temporal context and head-specific multimodal reasoning substantially enhance both accuracy and safety-adherent trajectory forecasting in urban driving scenarios.

Abstract

Predicting the trajectories of surrounding agents is an essential ability for autonomous vehicles navigating through complex traffic scenes. The future trajectories of agents can be inferred using two important cues: the locations and past motion of agents, and the static scene structure. Due to the high variability in scene structure and agent configurations, prior work has employed the attention mechanism, applied separately to the scene and agent configuration to learn the most salient parts of both cues. However, the two cues are tightly linked. The agent configuration can inform what part of the scene is most relevant to prediction. The static scene in turn can help determine the relative influence of agents on each other's motion. Moreover, the distribution of future trajectories is multimodal, with modes corresponding to the agent's intent. The agent's intent also informs what part of the scene and agent configuration is relevant to prediction. We thus propose a novel approach applying multi-head attention by considering a joint representation of the static scene and surrounding agents. We use each attention head to generate a distinct future trajectory to address multimodality of future trajectories. Our model achieves state of the art results on the nuScenes prediction benchmark and generates diverse future trajectories compliant with scene structure and agent configuration.

Paper Structure

This paper contains 18 sections, 16 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: MHA-JAM (MHA with Joint Agent Map representation): Each LSTM encoder generates an encoding vector of one of the surrounding agent recent motion. The CNN backbone transforms the input map image to a 3D tensor of scene features. A combined representation of the context is build by concatenating the surrounding agents motion encodings and the scene features. Each attention head models a possible way of interaction between the target (green car) and the combined context features. Each LSTM decoder receives an context vector and the target vehicle encoding and generates a possible distribution over a possible predicted trajectory conditioned on each context.
  • Figure 2: Attention modules in MHA-JAM: We generate keys and values by applying 1x1 convolutional layers to a joint representation of the map and surrounding agents, while the trajectory encoding of the target agent serves as the query.
  • Figure 3: Ablation experiments: We evaluate through ablation experiments, the importance of input cues (top), the effectiveness of a joint agent map representation for generating keys and values for attention heads (middle), the effectiveness of attention heads specialized for particular modes of the multimodal predictive distribution (middle), and finally the effectiveness of the auxiliary off-road loss (bottom). For each experiment we plot the metrics MinADE$_k$ (left), MissRate$_{k,2}$ (middle) and off-road rate (right) for the $k$ likeliest trajectories output by the models.
  • Figure 4: MHA with separate agent-map representation: We compare our model to a baseline where attention weights are separately generated for the map and agent features
  • Figure 5: Examples of produced attention maps and trajectories with MHA-JAM (off-road) model
  • ...and 1 more figures