MGTR: Multi-Granular Transformer for Motion Prediction with LiDAR
Yiqian Gan, Hao Xiao, Yizhe Zhao, Ethan Zhang, Zhe Huang, Xin Ye, Lingting Ge
TL;DR
MGTR tackles long-horizon motion prediction for heterogeneous agents by fusing multimodal context from agent histories, HD-map elements, and LiDAR-derived features within a multi-granular Transformer framework. The encoder aggregates multi-granular tokens using local attention and motion-aware token selection, while the decoder maintains an intention goal set and uses trajectory- and motion-aware context search to refine multiple future trajectories modeled with a Gaussian Mixture Model: $ ext{T}_{ ext{scene}} = ext{MLP}(F_e^A)$ and $ ext{T}_{ ext{target}} = ext{MLP}(F_d^j)$ with mixture weights $p \\in \\mathbb{R}^{ ext{K}}$ for $ ext{K}$ modes. Training optimizes a weighted sum of losses $\igl(\mathcal{L}_{aux}, \mathcal{L}_{cls}, \mathcal{L}_{GMM}\bigr)$, enabling specialized modes and robust multimodal predictions. Empirically, MGTR achieves state-of-the-art results on the WOMD-LiDAR benchmark, notably improving mAP for pedestrians and cyclists, and demonstrating the practical value of multi-granular, LiDAR-enhanced context in real-world autonomous driving scenarios.
Abstract
Motion prediction has been an essential component of autonomous driving systems since it handles highly uncertain and complex scenarios involving moving agents of different types. In this paper, we propose a Multi-Granular TRansformer (MGTR) framework, an encoder-decoder network that exploits context features in different granularities for different kinds of traffic agents. To further enhance MGTR's capabilities, we leverage LiDAR point cloud data by incorporating LiDAR semantic features from an off-the-shelf LiDAR feature extractor. We evaluate MGTR on Waymo Open Dataset motion prediction benchmark and show that the proposed method achieved state-of-the-art performance, ranking 1st on its leaderboard (https://waymo.com/open/challenges/2023/motion-prediction/).
