Table of Contents
Fetching ...

LMFormer: Lane based Motion Prediction Transformer

Harsh Yadav, Maximilian Schaefer, Kun Zhao, Tobias Meisen

TL;DR

LMFormer tackles lane-aware trajectory prediction by embedding lane connectivity into a transformer through lane-aware cross-attention and a GNN-like map encoder. It introduces iterative refinement across stacked transformer layers and learnable mode queries to produce diverse, lane-consistent trajectories while maintaining computational efficiency. Evaluations on nuScenes and Deep Scenario show state-of-the-art results and cross-dataset generalization when trained on combined data, highlighting the value of explicit lane topology and scalable training. The work emphasizes explainability via attention maps and identifies areas for future improvement in velocity profiling and maneuver diversity.

Abstract

Motion prediction plays an important role in autonomous driving. This study presents LMFormer, a lane-aware transformer network for trajectory prediction tasks. In contrast to previous studies, our work provides a simple mechanism to dynamically prioritize the lanes and shows that such a mechanism introduces explainability into the learning behavior of the network. Additionally, LMFormer uses the lane connection information at intersections, lane merges, and lane splits, in order to learn long-range dependency in lane structure. Moreover, we also address the issue of refining the predicted trajectories and propose an efficient method for iterative refinement through stacked transformer layers. For benchmarking, we evaluate LMFormer on the nuScenes dataset and demonstrate that it achieves SOTA performance across multiple metrics. Furthermore, the Deep Scenario dataset is used to not only illustrate cross-dataset network performance but also the unification capabilities of LMFormer to train on multiple datasets and achieve better performance.

LMFormer: Lane based Motion Prediction Transformer

TL;DR

LMFormer tackles lane-aware trajectory prediction by embedding lane connectivity into a transformer through lane-aware cross-attention and a GNN-like map encoder. It introduces iterative refinement across stacked transformer layers and learnable mode queries to produce diverse, lane-consistent trajectories while maintaining computational efficiency. Evaluations on nuScenes and Deep Scenario show state-of-the-art results and cross-dataset generalization when trained on combined data, highlighting the value of explicit lane topology and scalable training. The work emphasizes explainability via attention maps and identifies areas for future improvement in velocity profiling and maneuver diversity.

Abstract

Motion prediction plays an important role in autonomous driving. This study presents LMFormer, a lane-aware transformer network for trajectory prediction tasks. In contrast to previous studies, our work provides a simple mechanism to dynamically prioritize the lanes and shows that such a mechanism introduces explainability into the learning behavior of the network. Additionally, LMFormer uses the lane connection information at intersections, lane merges, and lane splits, in order to learn long-range dependency in lane structure. Moreover, we also address the issue of refining the predicted trajectories and propose an efficient method for iterative refinement through stacked transformer layers. For benchmarking, we evaluate LMFormer on the nuScenes dataset and demonstrate that it achieves SOTA performance across multiple metrics. Furthermore, the Deep Scenario dataset is used to not only illustrate cross-dataset network performance but also the unification capabilities of LMFormer to train on multiple datasets and achieve better performance.

Paper Structure

This paper contains 16 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: A schematic of predicted trajectories (green) with mode probabilities for the target agent. In (a), the initial predictions exhibit noticeable misalignment with the drivable lanes. However, through iterative refinement steps, the trajectories become increasingly structured, as seen in (b) and (c). Notably, the final predictions not only align well with the road topology but also exhibit a realistic distribution of possible future paths.
  • Figure 2: An Illustration of LMFormer architecture. We employ a transformer-based encoder-decoder architecture to generate multiple scene-consistent trajectories for all the dynamic agents. Notably the static context only consists of lane segments.
  • Figure 3: The encoder receives Static and Dynamic context as input and it outputs Lanes and Agents Encodings. The encoder is divided into two parts: Map Encoder and Agent Encoder. The Map Encoder models the long-range interaction among the lane segments. The Agent Encoder models the interaction of all the surrounding static and dynamic elements into each agent's latent embedding. The attention mechanisms are illustrated by green (self-attention) and blue (cross-attention) arrows, where the arrowheads point toward the queries and the tails point away from keys/values. The encoders repeat the interaction modeling N times, to learn complex interactions, where the weights across each layer are not shared.
  • Figure 4: A depiction of the decoder architecture. The decoder performs cross-attention in between learnable mode queries and Scene Encodings (keys/values). The cross-attention layer is stacked N times and the intermittent output as well as the final output queries are transformed into trajectories with MLP. Thus we obtain N trajectories corresponding to each mode of every agent. During the training, all these N trajectories are trained against the ground truth, while during inference only the final layer output is generated. Importantly, the weights across the stacked cross-attention layers are not shared, while those in the MLP layers are.
  • Figure 5: An illustration of predicted trajectories (green) with mode probabilities for the target agent at the final output layer. These samples are selected out of the 100 worst predictions based on minADE5.
  • ...and 2 more figures