Table of Contents
Fetching ...

CogDrive: Cognition-Driven Multimodal Prediction-Planning Fusion for Safe Autonomy

Heye Huang, Yibin Yang, Mingfeng Fan, Haoran Wang, Xiaocong Zhao, Jianqiang Wang

TL;DR

CogDrive presents a cognition-driven framework that unifies multimodal trajectory prediction with safety-stabilized planning for safe autonomy in complex traffic. It introduces modality-aware prediction using topological motion semantics and a symmetric relational encoder, paired with a DETR-style decoder to produce diverse, interpretable mode hypotheses. The planning module implements a multimodal preparedness strategy and a single-vehicle optimizer with dynamic constraint planes and robust safety corridors, backed by a local QP formulation. Evaluations on Argoverse 2 and INTERACTION show strong prediction accuracy, low miss rates, and stable, adaptive behavior in congested and interactive scenarios. The work demonstrates how coupling cognitive prediction with safety-focused planning can achieve reliable, interpretable autonomous driving under multimodal uncertainty.

Abstract

Safe autonomous driving in mixed traffic requires a unified understanding of multimodal interactions and dynamic planning under uncertainty. Existing learning based approaches struggle to capture rare but safety critical behaviors, while rule based systems often lack adaptability in complex interactions. To address these limitations, CogDrive introduces a cognition driven multimodal prediction and planning framework that integrates explicit modal reasoning with safety aware trajectory optimization. The prediction module adopts cognitive representations of interaction modes based on topological motion semantics and nearest neighbor relational encoding. With a differentiable modal loss and multimodal Gaussian decoding, CogDrive learns sparse and unbalanced interaction behaviors and improves long horizon trajectory prediction. The planning module incorporates an emergency response concept and optimizes safety stabilized trajectories, where short term consistent branches ensure safety during replanning cycles and long term branches support smooth and collision free motion under low probability switching modes. Experiments on Argoverse2 and INTERACTION datasets show that CogDrive achieves strong performance in trajectory accuracy and miss rate, while closed loop simulations confirm adaptive behavior in merge and intersection scenarios. By combining cognitive multimodal prediction with safety oriented planning, CogDrive offers an interpretable and reliable paradigm for safe autonomy in complex traffic.

CogDrive: Cognition-Driven Multimodal Prediction-Planning Fusion for Safe Autonomy

TL;DR

CogDrive presents a cognition-driven framework that unifies multimodal trajectory prediction with safety-stabilized planning for safe autonomy in complex traffic. It introduces modality-aware prediction using topological motion semantics and a symmetric relational encoder, paired with a DETR-style decoder to produce diverse, interpretable mode hypotheses. The planning module implements a multimodal preparedness strategy and a single-vehicle optimizer with dynamic constraint planes and robust safety corridors, backed by a local QP formulation. Evaluations on Argoverse 2 and INTERACTION show strong prediction accuracy, low miss rates, and stable, adaptive behavior in congested and interactive scenarios. The work demonstrates how coupling cognitive prediction with safety-focused planning can achieve reliable, interpretable autonomous driving under multimodal uncertainty.

Abstract

Safe autonomous driving in mixed traffic requires a unified understanding of multimodal interactions and dynamic planning under uncertainty. Existing learning based approaches struggle to capture rare but safety critical behaviors, while rule based systems often lack adaptability in complex interactions. To address these limitations, CogDrive introduces a cognition driven multimodal prediction and planning framework that integrates explicit modal reasoning with safety aware trajectory optimization. The prediction module adopts cognitive representations of interaction modes based on topological motion semantics and nearest neighbor relational encoding. With a differentiable modal loss and multimodal Gaussian decoding, CogDrive learns sparse and unbalanced interaction behaviors and improves long horizon trajectory prediction. The planning module incorporates an emergency response concept and optimizes safety stabilized trajectories, where short term consistent branches ensure safety during replanning cycles and long term branches support smooth and collision free motion under low probability switching modes. Experiments on Argoverse2 and INTERACTION datasets show that CogDrive achieves strong performance in trajectory accuracy and miss rate, while closed loop simulations confirm adaptive behavior in merge and intersection scenarios. By combining cognitive multimodal prediction with safety oriented planning, CogDrive offers an interpretable and reliable paradigm for safe autonomy in complex traffic.

Paper Structure

This paper contains 18 sections, 28 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the cognition-driven multimodal prediction network in CogDrive. Historical trajectories, high-definition maps, and local coordinate information are encoded through three MLP-based embedding networks. Their outputs are fused by a symmetric fusion encoder that models pairwise spatial and behavioral relations via relative positional and nearest-neighbor encoding. Learnable query decoding with multi-branch cross-attention generates multimodal joint trajectories, each representing a distinct interaction mode between the ego and surrounding agents.
  • Figure 2: Architecture of the symmetric fusion encoder. Its structure resembles a self-attention model but explicitly introduces relational encoding between different instance-centric coordinate systems. Through symmetric feature updates and relative positional embeddings, the encoder preserves viewpoint and ordering invariance across instances. Each instance is represented as a node, and their pairwise coordinate transformations define directed edges, forming a fully connected self-looped graph that ensures consistent bidirectional fusion of multimodal features.
  • Figure 3: Decoder architecture for interaction-aware representation learning. The decoder receives relational features from the symmetric fusion encoder and generates multimodal hypotheses through iterative self- and cross-attention. Each learnable query combines an anchor component and a modality-guided component derived from the ego and its nearest neighbors, enabling interpretable reasoning over distinct behavioral modes. The decoder progressively refines these queries into trajectory hypotheses with associated probabilities, forming a complete mapping from interaction context to multimodal motion prediction.
  • Figure 4: Dynamic safety-aware trajectory planning in CogDrive. Cognitive prediction provides a mode-weighted nominal trajectory, from which dynamic constraint planes and an adaptive safety boundary are constructed. Through local QP updates, the planner yields collision-free solutions that realize either cooperative yielding or complete avoidance, aligning the ego motion with multimodal interaction intentions.
  • Figure 5: Representative multimodal trajectory prediction across diverse driving scenarios. The red vehicle represents the ego agent, and colored lines indicate predicted trajectories under different interaction modes. The examples span intersections and roundabouts with varying traffic densities and driving behaviors. Dots indicate past motion, while solid and dashed lines show multimodal futures. CogDrive differentiates behavioral modes, preserves trajectory smoothness, and maintains consistent prediction quality across diverse environments.
  • ...and 1 more figures