V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models
Junwei You, Haotian Shi, Zhuoyu Jiang, Zilin Huang, Rui Gan, Keshu Wu, Xi Cheng, Xiaopeng Li, Bin Ran
TL;DR
The paper tackles the challenge of robust, end-to-end cooperative autonomous driving by fusing multiperspective vehicle and infrastructure data with textual scene descriptions through a large vision-language model. V2X-VLM employs contrastive feature alignment and a teacher–student knowledge distillation scheme to stabilize training and enhance cross-modal understanding, achieving state-of-the-art trajectory planning on the DAIR-V2X dataset. Key findings include superior L2 error and near-zero collision rates, a detailed analysis of transmission-cost trade-offs with downsampling, and demonstrated robustness to input perturbations, supporting real-time deployment. The work advances practical cooperative driving by integrating rich semantic context into planning while addressing communication constraints and training stability.
Abstract
Vehicle-to-everything (V2X) cooperation has emerged as a promising paradigm to overcome the perception limitations of classical autonomous driving by leveraging information from both ego-vehicle and infrastructure sensors. However, effectively fusing heterogeneous visual and semantic information while ensuring robust trajectory planning remains a significant challenge. This paper introduces V2X-VLM, a novel end-to-end (E2E) cooperative autonomous driving framework based on vision-language models (VLMs). V2X-VLM integrates multiperspective camera views from vehicles and infrastructure with text-based scene descriptions to enable a more comprehensive understanding of driving environments. Specifically, we propose a contrastive learning-based mechanism to reinforce the alignment of heterogeneous visual and textual characteristics, which enhances the semantic understanding of complex driving scenarios, and employ a knowledge distillation strategy to stabilize training. Experiments on a large real-world dataset demonstrate that V2X-VLM achieves state-of-the-art trajectory planning accuracy, significantly reducing L2 error and collision rate compared to existing cooperative autonomous driving baselines. Ablation studies validate the contributions of each component. Moreover, the evaluation of robustness and efficiency highlights the practicality of V2X-VLM for real-world deployment to enhance overall autonomous driving safety and decision-making.
