Co-MTP: A Cooperative Trajectory Prediction Framework with Multi-Temporal Fusion for Autonomous Driving
Xinyu Zhang, Zewei Zhou, Zhaoyi Wang, Yangjie Ji, Yanjun Huang, Hong Chen
TL;DR
Co-MTP tackles planning-aware prediction under occlusion by fusing temporal cues from multiple V2X sources through a heterogeneous graph Transformer that operates in both history and future domains. The method introduces Cross-Temporal and Cross-Agent fusion and a multimodal decoder to generate multiple trajectory modes, guided by a planning-aware objective. Experiments on V2X-Seq show state-of-the-art performance and demonstrate the benefits of incorporating both history and infrastructure-predicted future information, as well as robustness to noise and delays. This framework enables more reliable, infrastructure-assisted planning in autonomous driving by leveraging temporal context across agents. The approach advances cooperative prediction by integrating multi-view, multi-time information with explicit planning considerations.
Abstract
Vehicle-to-everything technologies (V2X) have become an ideal paradigm to extend the perception range and see through the occlusion. Exiting efforts focus on single-frame cooperative perception, however, how to capture the temporal cue between frames with V2X to facilitate the prediction task even the planning task is still underexplored. In this paper, we introduce the Co-MTP, a general cooperative trajectory prediction framework with multi-temporal fusion for autonomous driving, which leverages the V2X system to fully capture the interaction among agents in both history and future domains to benefit the planning. In the history domain, V2X can complement the incomplete history trajectory in single-vehicle perception, and we design a heterogeneous graph transformer to learn the fusion of the history feature from multiple agents and capture the history interaction. Moreover, the goal of prediction is to support future planning. Thus, in the future domain, V2X can provide the prediction results of surrounding objects, and we further extend the graph transformer to capture the future interaction among the ego planning and the other vehicles' intentions and obtain the final future scenario state under a certain planning action. We evaluate the Co-MTP framework on the real-world dataset V2X-Seq, and the results show that Co-MTP achieves state-of-the-art performance and that both history and future fusion can greatly benefit prediction.
