Table of Contents
Fetching ...

IMPACT: Behavioral Intention-aware Multimodal Trajectory Prediction with Adaptive Context Trimming

Jiawei Sun, Xibin Yue, Jiahui Li, Tianle Shen, Chengran Yuan, Shuo Sun, Sheng Guo, Quanyun Zhou, Marcelo H Ang

TL;DR

IMPACT tackles the challenge of predicting both the behavioral intentions and future trajectories of surrounding agents in autonomous driving. It introduces a unified model with a shared context encoder and dual context filters that prune irrelevant agents and map polylines using predicted intents and vectorized occupancy, plus an automatic labeling approach for intentions on large datasets. The method achieves state-of-the-art results on Waymo motion benchmarks, including Marginal and Interactive predictions, and demonstrates real-world viability with a deployment-ready design that reduces computation without sacrificing accuracy. Overall, IMPACT enhances interpretation, efficiency, and robustness of motion prediction, enabling safer and more reliable autonomous planning.

Abstract

While most prior research has focused on improving the precision of multimodal trajectory predictions, the explicit modeling of multimodal behavioral intentions (e.g., yielding, overtaking) remains relatively underexplored. This paper proposes a unified framework that jointly predicts both behavioral intentions and trajectories to enhance prediction accuracy, interpretability, and efficiency. Specifically, we employ a shared context encoder for both intention and trajectory predictions, thereby reducing structural redundancy and information loss. Moreover, we address the lack of ground-truth behavioral intention labels in mainstream datasets (Waymo, Argoverse) by auto-labeling these datasets, thus advancing the community's efforts in this direction. We further introduce a vectorized occupancy prediction module that infers the probability of each map polyline being occupied by the target vehicle's future trajectory. By leveraging these intention and occupancy prediction priors, our method conducts dynamic, modality-dependent pruning of irrelevant agents and map polylines in the decoding stage, effectively reducing computational overhead and mitigating noise from non-critical elements. Our approach ranks first among LiDAR-free methods on the Waymo Motion Dataset and achieves first place on the Waymo Interactive Prediction Dataset. Remarkably, even without model ensembling, our single-model framework improves the soft mean average precision (softmAP) by 10 percent compared to the second-best method in the Waymo Interactive Prediction Leaderboard. Furthermore, the proposed framework has been successfully deployed on real vehicles, demonstrating its practical effectiveness in real-world applications.

IMPACT: Behavioral Intention-aware Multimodal Trajectory Prediction with Adaptive Context Trimming

TL;DR

IMPACT tackles the challenge of predicting both the behavioral intentions and future trajectories of surrounding agents in autonomous driving. It introduces a unified model with a shared context encoder and dual context filters that prune irrelevant agents and map polylines using predicted intents and vectorized occupancy, plus an automatic labeling approach for intentions on large datasets. The method achieves state-of-the-art results on Waymo motion benchmarks, including Marginal and Interactive predictions, and demonstrates real-world viability with a deployment-ready design that reduces computation without sacrificing accuracy. Overall, IMPACT enhances interpretation, efficiency, and robustness of motion prediction, enabling safer and more reliable autonomous planning.

Abstract

While most prior research has focused on improving the precision of multimodal trajectory predictions, the explicit modeling of multimodal behavioral intentions (e.g., yielding, overtaking) remains relatively underexplored. This paper proposes a unified framework that jointly predicts both behavioral intentions and trajectories to enhance prediction accuracy, interpretability, and efficiency. Specifically, we employ a shared context encoder for both intention and trajectory predictions, thereby reducing structural redundancy and information loss. Moreover, we address the lack of ground-truth behavioral intention labels in mainstream datasets (Waymo, Argoverse) by auto-labeling these datasets, thus advancing the community's efforts in this direction. We further introduce a vectorized occupancy prediction module that infers the probability of each map polyline being occupied by the target vehicle's future trajectory. By leveraging these intention and occupancy prediction priors, our method conducts dynamic, modality-dependent pruning of irrelevant agents and map polylines in the decoding stage, effectively reducing computational overhead and mitigating noise from non-critical elements. Our approach ranks first among LiDAR-free methods on the Waymo Motion Dataset and achieves first place on the Waymo Interactive Prediction Dataset. Remarkably, even without model ensembling, our single-model framework improves the soft mean average precision (softmAP) by 10 percent compared to the second-best method in the Waymo Interactive Prediction Leaderboard. Furthermore, the proposed framework has been successfully deployed on real vehicles, demonstrating its practical effectiveness in real-world applications.

Paper Structure

This paper contains 28 sections, 11 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: (a) illustrates the traditional predictor's context input, while (b) is our integrated approach jointly predicting behavioral intentions, trajectories, and vectorized occupancy. In our approach, the decoder stage is fed only with influential agents and relevant map elements.
  • Figure 2: An overview of framework of IMPACT. Both the Intention Predictor and the Vectorized Occupancy Predictor share the same context encoder with the Trajectory Decoder, leveraging their outputs to prune irrelevant agents and map polylines. This selective mechanism ensures that only the most critical context is fed into the decoder for final trajectory prediction.
  • Figure 3: An overview of our decoder framework, featuring context-aware pruning via symmetric dual filters.
  • Figure 4: Visualization results for joint (left) and marginal (prediction) results.
  • Figure 5: Visualization of Predicted Multimodal Occupancy and Intention Labels. In the top two rows, black agents represent other agents, while in the bottom two rows, they indicate ignored agents. Ground-truth trajectories are included for validation of predicted behaviors.
  • ...and 1 more figures