Table of Contents
Fetching ...

Co-MTP: A Cooperative Trajectory Prediction Framework with Multi-Temporal Fusion for Autonomous Driving

Xinyu Zhang, Zewei Zhou, Zhaoyi Wang, Yangjie Ji, Yanjun Huang, Hong Chen

TL;DR

Co-MTP tackles planning-aware prediction under occlusion by fusing temporal cues from multiple V2X sources through a heterogeneous graph Transformer that operates in both history and future domains. The method introduces Cross-Temporal and Cross-Agent fusion and a multimodal decoder to generate multiple trajectory modes, guided by a planning-aware objective. Experiments on V2X-Seq show state-of-the-art performance and demonstrate the benefits of incorporating both history and infrastructure-predicted future information, as well as robustness to noise and delays. This framework enables more reliable, infrastructure-assisted planning in autonomous driving by leveraging temporal context across agents. The approach advances cooperative prediction by integrating multi-view, multi-time information with explicit planning considerations.

Abstract

Vehicle-to-everything technologies (V2X) have become an ideal paradigm to extend the perception range and see through the occlusion. Exiting efforts focus on single-frame cooperative perception, however, how to capture the temporal cue between frames with V2X to facilitate the prediction task even the planning task is still underexplored. In this paper, we introduce the Co-MTP, a general cooperative trajectory prediction framework with multi-temporal fusion for autonomous driving, which leverages the V2X system to fully capture the interaction among agents in both history and future domains to benefit the planning. In the history domain, V2X can complement the incomplete history trajectory in single-vehicle perception, and we design a heterogeneous graph transformer to learn the fusion of the history feature from multiple agents and capture the history interaction. Moreover, the goal of prediction is to support future planning. Thus, in the future domain, V2X can provide the prediction results of surrounding objects, and we further extend the graph transformer to capture the future interaction among the ego planning and the other vehicles' intentions and obtain the final future scenario state under a certain planning action. We evaluate the Co-MTP framework on the real-world dataset V2X-Seq, and the results show that Co-MTP achieves state-of-the-art performance and that both history and future fusion can greatly benefit prediction.

Co-MTP: A Cooperative Trajectory Prediction Framework with Multi-Temporal Fusion for Autonomous Driving

TL;DR

Co-MTP tackles planning-aware prediction under occlusion by fusing temporal cues from multiple V2X sources through a heterogeneous graph Transformer that operates in both history and future domains. The method introduces Cross-Temporal and Cross-Agent fusion and a multimodal decoder to generate multiple trajectory modes, guided by a planning-aware objective. Experiments on V2X-Seq show state-of-the-art performance and demonstrate the benefits of incorporating both history and infrastructure-predicted future information, as well as robustness to noise and delays. This framework enables more reliable, infrastructure-assisted planning in autonomous driving by leveraging temporal context across agents. The approach advances cooperative prediction by integrating multi-view, multi-time information with explicit planning considerations.

Abstract

Vehicle-to-everything technologies (V2X) have become an ideal paradigm to extend the perception range and see through the occlusion. Exiting efforts focus on single-frame cooperative perception, however, how to capture the temporal cue between frames with V2X to facilitate the prediction task even the planning task is still underexplored. In this paper, we introduce the Co-MTP, a general cooperative trajectory prediction framework with multi-temporal fusion for autonomous driving, which leverages the V2X system to fully capture the interaction among agents in both history and future domains to benefit the planning. In the history domain, V2X can complement the incomplete history trajectory in single-vehicle perception, and we design a heterogeneous graph transformer to learn the fusion of the history feature from multiple agents and capture the history interaction. Moreover, the goal of prediction is to support future planning. Thus, in the future domain, V2X can provide the prediction results of surrounding objects, and we further extend the graph transformer to capture the future interaction among the ego planning and the other vehicles' intentions and obtain the final future scenario state under a certain planning action. We evaluate the Co-MTP framework on the real-world dataset V2X-Seq, and the results show that Co-MTP achieves state-of-the-art performance and that both history and future fusion can greatly benefit prediction.

Paper Structure

This paper contains 15 sections, 9 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The overall architecture of Co-MTP. In this framework, infrastructures share the history and their prediction results to ego AV. Then, we construct a heterogeneous scene graph with the processed trajectory data and map information, categorizing them according to the types of objects and map elements. Next, we initialize the features of nodes and edges in the relative coordinate system of each object. The CTCA Fusion is used to update the features of the nodes and edges selected by the STSA module over K Transformer layers. Finally, we take the nodes' hidden features from the last layer and input them into the Multimodal Decoder to obtain the multimodal trajectory prediction results.
  • Figure 2: Illustration of STFA in the heterogeneous graph. In addition to the object nodes with preprocessed data from the AV's view, objects with raw data from the infrastructure's view also participate as independent nodes. To model future interaction, we treat the AV'planning and infrastructure's prediction results as independent nodes, establishing edge relationships with historical nodes in the graph.
  • Figure 3: Qualitative examples of Co-MTP on V2X-Seq dataset. The red box are AV, while the orange ones are the predicted targets and the blue ones are objects. The predicted trajectories are shown in green, the history ground-truth are shown in blue, and the future ground-truth are shown in brown.