Table of Contents
Fetching ...

Towards Consistent and Explainable Motion Prediction using Heterogeneous Graph Attention

Tobias Demmler, Andreas Tamke, Thao Dang, Karsten Haug, Lars Mikelsons

TL;DR

The paper tackles the problem of trajectory prediction in autonomous driving being prone to drift from the actual road lanes due to information loss in conventional encoders. It introduces a refinement module that projects predicted trajectories back onto the HD map and a unified scene encoder built on a heterogeneous graph attention network to capture all relations in a single graph, enabling explainability via attention analysis. The two main contributions—the refinement module and the HGAT-based scene encoder—demonstrate improved trajectory consistency and accuracy on Argoverse 2, with end-to-end training yielding substantial gains and attention insights offering transparency into decision-making. This approach provides a practical, explainable, and adaptable framework that can be readily integrated into existing motion forecasting systems to enhance map-consistency and interpretability.

Abstract

In autonomous driving, accurately interpreting the movements of other road users and leveraging this knowledge to forecast future trajectories is crucial. This is typically achieved through the integration of map data and tracked trajectories of various agents. Numerous methodologies combine this information into a singular embedding for each agent, which is then utilized to predict future behavior. However, these approaches have a notable drawback in that they may lose exact location information during the encoding process. The encoding still includes general map information. However, the generation of valid and consistent trajectories is not guaranteed. This can cause the predicted trajectories to stray from the actual lanes. This paper introduces a new refinement module designed to project the predicted trajectories back onto the actual map, rectifying these discrepancies and leading towards more consistent predictions. This versatile module can be readily incorporated into a wide range of architectures. Additionally, we propose a novel scene encoder that handles all relations between agents and their environment in a single unified heterogeneous graph attention network. By analyzing the attention values on the different edges in this graph, we can gain unique insights into the neural network's inner workings leading towards a more explainable prediction.

Towards Consistent and Explainable Motion Prediction using Heterogeneous Graph Attention

TL;DR

The paper tackles the problem of trajectory prediction in autonomous driving being prone to drift from the actual road lanes due to information loss in conventional encoders. It introduces a refinement module that projects predicted trajectories back onto the HD map and a unified scene encoder built on a heterogeneous graph attention network to capture all relations in a single graph, enabling explainability via attention analysis. The two main contributions—the refinement module and the HGAT-based scene encoder—demonstrate improved trajectory consistency and accuracy on Argoverse 2, with end-to-end training yielding substantial gains and attention insights offering transparency into decision-making. This approach provides a practical, explainable, and adaptable framework that can be readily integrated into existing motion forecasting systems to enhance map-consistency and interpretability.

Abstract

In autonomous driving, accurately interpreting the movements of other road users and leveraging this knowledge to forecast future trajectories is crucial. This is typically achieved through the integration of map data and tracked trajectories of various agents. Numerous methodologies combine this information into a singular embedding for each agent, which is then utilized to predict future behavior. However, these approaches have a notable drawback in that they may lose exact location information during the encoding process. The encoding still includes general map information. However, the generation of valid and consistent trajectories is not guaranteed. This can cause the predicted trajectories to stray from the actual lanes. This paper introduces a new refinement module designed to project the predicted trajectories back onto the actual map, rectifying these discrepancies and leading towards more consistent predictions. This versatile module can be readily incorporated into a wide range of architectures. Additionally, we propose a novel scene encoder that handles all relations between agents and their environment in a single unified heterogeneous graph attention network. By analyzing the attention values on the different edges in this graph, we can gain unique insights into the neural network's inner workings leading towards a more explainable prediction.
Paper Structure (40 sections, 6 equations, 6 figures, 2 tables)

This paper contains 40 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of our heterogeneous graph. There are three types of nodes. The lane nodes represent the map. The trajectory-step nodes represent positional information of trajectories at a specific point in time. The full-trajectory nodes accumulate the information of the corresponding trajectory-step nodes and distribute the combined information back. Lane nodes interact with other lane nodes through the left, right, predecessor and successor edges. Information between lane nodes and trajectory-step nodes is exchanged through the lane-step and step-lane edges. Information between two agents at a specific timestep is exchanged via the step-step edges.
  • Figure 2: Overview of how the heterogeneous graph structure looks like in an example scene. (a) visualizes the interaction between lane nodes. Green arrow: left neighbor, red arrows: predecessor nodes, blue arrows: successor nodes. (b) demonstrates the connections between trajectory-step nodes (depicted as cars) and lane nodes. Trajectory-step nodes are connected to nearby lane nodes with a similar orientation. (c) shows the interaction between different trajectory steps. Only trajectory steps of the same point in time within a certain distance threshold can interact with each other. $t_i$ indicates the timestep and the color of the node indicates different agents. (d) each observed trajectory has its own full-trajectory node. This node accumulates the information of the corresponding trajectory-step nodes and disperses the combined information back to the individual trajectory steps.
  • Figure 3: The network architecture starts by encoding varying-sized input features into uniform-sized feature vectors. Map features are encoded with a GNN, followed by encoding the whole scene. Trajectory node features are extracted and merged with residual trajectory features for final feature encoding. These final features are processed by independent prediction heads to create trajectory proposals. If no refinement module is employed, the confidence module ranks the predicted trajectories. If a refinement module is used, it takes in trajectory proposals, final feature vectors and encoded lane nodes to predict and rank the final trajectories.
  • Figure 4: Overview of the refinement module. It leverages parts of the same heterogeneous graph. 1) Map information is transported to the predicted trajectory steps. 2) Corresponding trajectory steps are accumulated in the full-trajectory nodes. 3) The accumulated information is distributed back to the individual trajectory-step nodes. 4) The refinement network modifies the current trajectory prediction. In the end, the trajectories are rated in the confidence network.
  • Figure 5: (a) Without the refinement module, the exact lane boundaries are lost and the prediction drifts into the other lane. (b) With the refinement module, all the map information can be used to shift the trajectories toward the actual lane. Blue line: observed history, red line: ground truth future trajectory, green line: predicted trajectory with highest confidence, orange lines: other predictions.
  • ...and 1 more figures