ModeSeq: Taming Sparse Multimodal Motion Prediction with Sequential Mode Modeling
Zikang Zhou, Hengjian Zhou, Haibo Hu, Zihao Wen, Jianping Wang, Yung-Hui Li, Yu-Kai Huang
TL;DR
ModeSeq tackles multimodal motion prediction by modeling future modes as a sequence, rather than decoding many independent trajectories. It introduces a Memory Transformer–Context Transformer based single-layer mechanism, coupled with iterative refinement and a mode rearrangement step, and an Early-Match-Take-All training objective to improve mode coverage and confidence calibration without dense mode prediction. Across WOMD and Argoverse 2, ModeSeq achieves balanced improvements in mode coverage, confidence scoring, and trajectory accuracy, and can extrapolate to generate additional plausible modes when uncertainty is high. The approach is parameter-efficient and versatile, capable of integrating with alternative scene encoders and enabling on-board deployment potential with improved multimodal reasoning and reduced post-processing needs.
Abstract
Anticipating the multimodality of future events lays the foundation for safe autonomous driving. However, multimodal motion prediction for traffic agents has been clouded by the lack of multimodal ground truth. Existing works predominantly adopt the winner-take-all training strategy to tackle this challenge, yet still suffer from limited trajectory diversity and uncalibrated mode confidence. While some approaches address these limitations by generating excessive trajectory candidates, they necessitate a post-processing stage to identify the most representative modes, a process lacking universal principles and compromising trajectory accuracy. We are thus motivated to introduce ModeSeq, a new multimodal prediction paradigm that models modes as sequences. Unlike the common practice of decoding multiple plausible trajectories in one shot, ModeSeq requires motion decoders to infer the next mode step by step, thereby more explicitly capturing the correlation between modes and significantly enhancing the ability to reason about multimodality. Leveraging the inductive bias of sequential mode prediction, we also propose the Early-Match-Take-All (EMTA) training strategy to diversify the trajectories further. Without relying on dense mode prediction or heuristic post-processing, ModeSeq considerably improves the diversity of multimodal output while attaining satisfactory trajectory accuracy, resulting in balanced performance on motion prediction benchmarks. Moreover, ModeSeq naturally emerges with the capability of mode extrapolation, which supports forecasting more behavior modes when the future is highly uncertain.
