Table of Contents
Fetching ...

Multi-modal Integrated Prediction and Decision-making with Adaptive Interaction Modality Explorations

Tong Li, Lu Zhang, Sikang Liu, Shaojie Shen

TL;DR

This work tackles prediction and planning for autonomous driving in dense, dynamic traffic by introducing MIND, a framework that jointly predicts scene-level futures and ego decisions using a transformer-based predictor and Gaussian mixture models, combined with Adaptive Interaction Modality Exploration (AIME) to build a scenario tree. AIME dynamically branches the scene tree based on uncertainty variation, then pruning/merging via interaction modalities to keep the tree compact. Contingency planning operates on the resulting scenario trees to produce trajectory trees that optimize under multi-modal evolutions and safety constraints. Evaluations on the Argoverse 2 dataset demonstrate superior performance in both open-loop predictions and closed-loop driving simulations compared with strong baselines, indicating practical potential for reliable, interactive autonomous driving in complex environments.

Abstract

Navigating dense and dynamic environments poses a significant challenge for autonomous driving systems, owing to the intricate nature of multimodal interaction, wherein the actions of various traffic participants and the autonomous vehicle are complex and implicitly coupled. In this paper, we propose a novel framework, Multi-modal Integrated predictioN and Decision-making (MIND), which addresses the challenges by efficiently generating joint predictions and decisions covering multiple distinctive interaction modalities. Specifically, MIND leverages learning-based scenario predictions to obtain integrated predictions and decisions with social-consistent interaction modality and utilizes a modality-aware dynamic branching mechanism to generate scenario trees that efficiently capture the evolutions of distinctive interaction modalities with low variation of interaction uncertainty along the planning horizon. The scenario trees are seamlessly utilized by the contingency planning under interaction uncertainty to obtain clear and considerate maneuvers accounting for multi-modal evolutions. Comprehensive experimental results in the closed-loop simulation based on the real-world driving dataset showcase superior performance to other strong baselines under various driving contexts.

Multi-modal Integrated Prediction and Decision-making with Adaptive Interaction Modality Explorations

TL;DR

This work tackles prediction and planning for autonomous driving in dense, dynamic traffic by introducing MIND, a framework that jointly predicts scene-level futures and ego decisions using a transformer-based predictor and Gaussian mixture models, combined with Adaptive Interaction Modality Exploration (AIME) to build a scenario tree. AIME dynamically branches the scene tree based on uncertainty variation, then pruning/merging via interaction modalities to keep the tree compact. Contingency planning operates on the resulting scenario trees to produce trajectory trees that optimize under multi-modal evolutions and safety constraints. Evaluations on the Argoverse 2 dataset demonstrate superior performance in both open-loop predictions and closed-loop driving simulations compared with strong baselines, indicating practical potential for reliable, interactive autonomous driving in complex environments.

Abstract

Navigating dense and dynamic environments poses a significant challenge for autonomous driving systems, owing to the intricate nature of multimodal interaction, wherein the actions of various traffic participants and the autonomous vehicle are complex and implicitly coupled. In this paper, we propose a novel framework, Multi-modal Integrated predictioN and Decision-making (MIND), which addresses the challenges by efficiently generating joint predictions and decisions covering multiple distinctive interaction modalities. Specifically, MIND leverages learning-based scenario predictions to obtain integrated predictions and decisions with social-consistent interaction modality and utilizes a modality-aware dynamic branching mechanism to generate scenario trees that efficiently capture the evolutions of distinctive interaction modalities with low variation of interaction uncertainty along the planning horizon. The scenario trees are seamlessly utilized by the contingency planning under interaction uncertainty to obtain clear and considerate maneuvers accounting for multi-modal evolutions. Comprehensive experimental results in the closed-loop simulation based on the real-world driving dataset showcase superior performance to other strong baselines under various driving contexts.
Paper Structure (27 sections, 13 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 27 sections, 13 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: A branch of the generated scenario tree and its corresponding topological structure. In MIND, we employ a learning-based scene-consistent driver model coupled with the adaptive interaction modality exploration (AIME) mechanism to efficiently construct the scenario tree. For each branch originating from the root node, we utilize contingency planning to generate a trajectory tree, accommodating multi-modal future evolutions.
  • Figure 2: Illustration of the components and the workflow in the MIND framework.
  • Figure 3: Illustration of one AIME-guided branching. The nodes of the scenario tree contain the states of the ego vehicle and agents. Firstly, the scenario tree is extended on the branching node leveraging the scenario prediction network. Then, the extended scenario tree is simplified by the pruning and merging process according to the interaction modality analysis. Finally, end nodes and branching nodes are determined during the adaptive branching, which triggers the next AIME process if branching nodes exist.
  • Figure 4: Architecture of the scene prediction network. After the feature encoding, we mix the scene-level mode queries with agent features. As illustrated, $A=3$ agents are included in this scene, while $K=6$ mode queries are injected. At last, the scene decoder generates $K$ possible joint future scenes with estimated probabilities.
  • Figure 5: Intersection Scenarios selected from Argoverse 2 validation split for comparisons. The ego vehicle is colored in blue. The trajectories are colored with fading purple, and vehicles in key frames are visualized with fading colors, respectively. Scen. I: Three vehicles meet at the intersection. The vehicle from the top right exhibits a misleading intention of left turning, while the vehicle from the top left can easily be misidentified as taking the right-of-way. Scen. II: Two vehicles enter the intersection. The yielding intention of the bottom-left vehicle needs to be identified to determine the passing priority. Scen. III: Two vehicles approach the intersection. The vehicle's intention from the top right can easily be misidentified as going straight, which would influence the behavior of the ego vehicle during the unprotected left turn.
  • ...and 5 more figures