Trajectory Prediction for Autonomous Driving based on Multi-Head Attention with Joint Agent-Map Representation
Kaouther Messaoud, Nachiket Deo, Mohan M. Trivedi, Fawzi Nashashibi
TL;DR
The paper tackles multimodal trajectory prediction for autonomous driving by introducing MHA-JAM, a multi-head attention model that operates on a joint agent-map representation. Each attention head specializes to a different future mode, enabling diverse, scene-consistent trajectory predictions via a mixture model with parameters $\Theta_l^t$ and component probabilities $P_l$. The method yields state-of-the-art results on the nuScenes prediction benchmark, improving metrics such as MinADE$_k$ and MissRate$_{k,d}$, and benefits from an off-road loss that enforces drivable-area conformity. Overall, the work demonstrates that joint spatio-temporal context and head-specific multimodal reasoning substantially enhance both accuracy and safety-adherent trajectory forecasting in urban driving scenarios.
Abstract
Predicting the trajectories of surrounding agents is an essential ability for autonomous vehicles navigating through complex traffic scenes. The future trajectories of agents can be inferred using two important cues: the locations and past motion of agents, and the static scene structure. Due to the high variability in scene structure and agent configurations, prior work has employed the attention mechanism, applied separately to the scene and agent configuration to learn the most salient parts of both cues. However, the two cues are tightly linked. The agent configuration can inform what part of the scene is most relevant to prediction. The static scene in turn can help determine the relative influence of agents on each other's motion. Moreover, the distribution of future trajectories is multimodal, with modes corresponding to the agent's intent. The agent's intent also informs what part of the scene and agent configuration is relevant to prediction. We thus propose a novel approach applying multi-head attention by considering a joint representation of the static scene and surrounding agents. We use each attention head to generate a distinct future trajectory to address multimodality of future trajectories. Our model achieves state of the art results on the nuScenes prediction benchmark and generates diverse future trajectories compliant with scene structure and agent configuration.
