MGTR: Multi-Granular Transformer for Motion Prediction with LiDAR

Yiqian Gan; Hao Xiao; Yizhe Zhao; Ethan Zhang; Zhe Huang; Xin Ye; Lingting Ge

MGTR: Multi-Granular Transformer for Motion Prediction with LiDAR

Yiqian Gan, Hao Xiao, Yizhe Zhao, Ethan Zhang, Zhe Huang, Xin Ye, Lingting Ge

TL;DR

MGTR tackles long-horizon motion prediction for heterogeneous agents by fusing multimodal context from agent histories, HD-map elements, and LiDAR-derived features within a multi-granular Transformer framework. The encoder aggregates multi-granular tokens using local attention and motion-aware token selection, while the decoder maintains an intention goal set and uses trajectory- and motion-aware context search to refine multiple future trajectories modeled with a Gaussian Mixture Model: $ ext{T}_{ ext{scene}} = ext{MLP}(F_e^A)$ and $ ext{T}_{ ext{target}} = ext{MLP}(F_d^j)$ with mixture weights $p \\in \\mathbb{R}^{ ext{K}}$ for $ ext{K}$ modes. Training optimizes a weighted sum of losses $\igl(\mathcal{L}_{aux}, \mathcal{L}_{cls}, \mathcal{L}_{GMM}\bigr)$, enabling specialized modes and robust multimodal predictions. Empirically, MGTR achieves state-of-the-art results on the WOMD-LiDAR benchmark, notably improving mAP for pedestrians and cyclists, and demonstrating the practical value of multi-granular, LiDAR-enhanced context in real-world autonomous driving scenarios.

Abstract

Motion prediction has been an essential component of autonomous driving systems since it handles highly uncertain and complex scenarios involving moving agents of different types. In this paper, we propose a Multi-Granular TRansformer (MGTR) framework, an encoder-decoder network that exploits context features in different granularities for different kinds of traffic agents. To further enhance MGTR's capabilities, we leverage LiDAR point cloud data by incorporating LiDAR semantic features from an off-the-shelf LiDAR feature extractor. We evaluate MGTR on Waymo Open Dataset motion prediction benchmark and show that the proposed method achieved state-of-the-art performance, ranking 1st on its leaderboard (https://waymo.com/open/challenges/2023/motion-prediction/).

MGTR: Multi-Granular Transformer for Motion Prediction with LiDAR

TL;DR

and

with mixture weights

for

modes. Training optimizes a weighted sum of losses

, enabling specialized modes and robust multimodal predictions. Empirically, MGTR achieves state-of-the-art results on the WOMD-LiDAR benchmark, notably improving mAP for pedestrians and cyclists, and demonstrating the practical value of multi-granular, LiDAR-enhanced context in real-world autonomous driving scenarios.

Abstract

Paper Structure (27 sections, 8 equations, 3 figures, 3 tables)

This paper contains 27 sections, 8 equations, 3 figures, 3 tables.

Introduction
Related work
Method
Multimodal Multi-Granular Inputs
Agent and map
LiDAR
Motion-aware context search
Transformer Encoder
Token aggregation and encoding
Future state enhancement
Transformer Decoder
Intention goal set
Token aggregation with intention goal set
Multimodal motion prediction with GMM
Training Loss
...and 12 more sections

Figures (3)

Figure 1: Comparing context information used in different motion prediction frameworks. Most previous methods varadarajan2022multipath++shi2022motion encode road graph only in a single granularity for all agents in the scene (green dashed box). In our method, various agents can benefit from multi-granular context information encoded from multimodal sources (blue dashed box).
Figure 2: An overview of our proposed MGTR. Agent trajectories and map elements are represented as polylines and encoded as agent and multi-granular map tokens. LiDAR data is processed by a pre-trained model into voxel features and further transformed into multi-granular LiDAR tokens. Motion-aware context search selects a set of map and LiDAR tokens, refined together with agent tokens through local self-attention in the Transformer encoder. Finally, a set of intention goals and their corresponding content features are sent to the decoder to aggregate context features. Multiple future trajectories of each agent will be predicted based on its intention goals, supporting the multimodal nature of agent behaviors.
Figure 3: Visualization of prediction result comparison between MTR shi2023mtr and MGTR (Ours). A global bird's-eye-view (including agents, HD map and LiDAR point cloud) and a local LiDAR visualization for each scene. For LiDAR point cloud, only limited semantic class such as vegetation (green points), building (cyan points), sidewalk (brown points), vehicle(orange points) and pedestrian (blue points) are shown for better visualization.

MGTR: Multi-Granular Transformer for Motion Prediction with LiDAR

TL;DR

Abstract

MGTR: Multi-Granular Transformer for Motion Prediction with LiDAR

Authors

TL;DR

Abstract

Table of Contents

Figures (3)