Table of Contents
Fetching ...

Conformal Trajectory Prediction with Multi-View Data Integration in Cooperative Driving

Xi Chen, Rahul Bhadani, Larry Head

TL;DR

V2INet presents an end-to-end framework for multi-view cooperative trajectory prediction by integrating ego-vehicle and infrastructure data through per-view graph encoders and a cross-graph attention fusion, followed by a multimodal Laplace-based decoder. A post-hoc conformal prediction module provides statistically valid and efficient uncertainty intervals for multimodal predictions, enabling safer decision-making in motion planning. Evaluated on the real-world V2X-Seq dataset, V2INet achieves strong predictive performance (notably in minFDE and MR) while delivering calibrated prediction intervals via CopulaCPTS, outperforming several baselines that rely on explicit multi-view association. The approach leverages pretrained single-view models, avoids complex pretraining for cross-view association, and offers a practical, scalable path toward uncertainty-aware cooperative driving.

Abstract

Current research on trajectory prediction primarily relies on data collected by onboard sensors of an ego vehicle. With the rapid advancement in connected technologies, such as vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communication, valuable information from alternate views becomes accessible via wireless networks. The integration of information from alternative views has the potential to overcome the inherent limitations associated with a single viewpoint, such as occlusions and limited field of view. In this work, we introduce V2INet, a novel trajectory prediction framework designed to model multi-view data by extending existing single-view models. Unlike previous approaches where the multi-view data is manually fused or formulated as a separate training stage, our model supports end-to-end training, enhancing both flexibility and performance. Moreover, the predicted multimodal trajectories are calibrated by a post-hoc conformal prediction module to get valid and efficient confidence regions. We evaluated the entire framework using the real-world V2I dataset V2X-Seq. Our results demonstrate superior performance in terms of Final Displacement Error (FDE) and Miss Rate (MR) using a single GPU. The code is publicly available at: https://github.com/xichennn/V2I_trajectory_prediction.

Conformal Trajectory Prediction with Multi-View Data Integration in Cooperative Driving

TL;DR

V2INet presents an end-to-end framework for multi-view cooperative trajectory prediction by integrating ego-vehicle and infrastructure data through per-view graph encoders and a cross-graph attention fusion, followed by a multimodal Laplace-based decoder. A post-hoc conformal prediction module provides statistically valid and efficient uncertainty intervals for multimodal predictions, enabling safer decision-making in motion planning. Evaluated on the real-world V2X-Seq dataset, V2INet achieves strong predictive performance (notably in minFDE and MR) while delivering calibrated prediction intervals via CopulaCPTS, outperforming several baselines that rely on explicit multi-view association. The approach leverages pretrained single-view models, avoids complex pretraining for cross-view association, and offers a practical, scalable path toward uncertainty-aware cooperative driving.

Abstract

Current research on trajectory prediction primarily relies on data collected by onboard sensors of an ego vehicle. With the rapid advancement in connected technologies, such as vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communication, valuable information from alternate views becomes accessible via wireless networks. The integration of information from alternative views has the potential to overcome the inherent limitations associated with a single viewpoint, such as occlusions and limited field of view. In this work, we introduce V2INet, a novel trajectory prediction framework designed to model multi-view data by extending existing single-view models. Unlike previous approaches where the multi-view data is manually fused or formulated as a separate training stage, our model supports end-to-end training, enhancing both flexibility and performance. Moreover, the predicted multimodal trajectories are calibrated by a post-hoc conformal prediction module to get valid and efficient confidence regions. We evaluated the entire framework using the real-world V2I dataset V2X-Seq. Our results demonstrate superior performance in terms of Final Displacement Error (FDE) and Miss Rate (MR) using a single GPU. The code is publicly available at: https://github.com/xichennn/V2I_trajectory_prediction.
Paper Structure (25 sections, 14 equations, 6 figures, 3 tables)

This paper contains 25 sections, 14 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Motivational scenarios. AV is in orange. (a) The AV is attempting a left turn and is at risk of a potential collision with an oncoming vehicle going straight (b) The AV’s onboard sensors have their field of view obstructed by large trucks or other vehicles, limiting their ability to detect the oncoming traffic. (c) The roadside cameras' view of the intersection. They are positioned to have an unobstructed view of the entire intersection, providing a complete picture of the traffic situation.
  • Figure 2: Inaccurate inference scores. The ground truth trajectory is represented in red, while predictions with lower scores are depicted in lighter shades of blue
  • Figure 3: Proposed model architecture. Data collected from the vehicle view are represented in red, while data from the infrastructure view are depicted in blue. The model takes as input the graph data constructed from both views. We apply single-view encoders to encode information from each view, followed by the fusion of the two embeddings through a cross-graph attention module. The final embedding passes through a multi-modal decoder, providing multimodal predictions for all the agents of interest.
  • Figure 4: Qualitative results. The ground truth (in red) and predicted multimodal trajectories (in different shades of blue) of the target agent are shown. Darker blue represents higher probability. Yellow and grey rectangles denotes road agents observed from vehicle view and infrastructure view, respectively.
  • Figure 5: Qualitative results comparison for three models: HiVT-Ego (first row), HiVT-PPVIC (second row), and ours (third row). The ground truth (in red) and predicted multimodal trajectories (in different shades of blue) of the target agent are shown. Darker blue represents higher probability. Yellow and grey rectangles denotes road agents observed from vehicle view and infrastructure view, respectively.
  • ...and 1 more figures