Table of Contents
Fetching ...

V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

Junwei You, Haotian Shi, Zhuoyu Jiang, Zilin Huang, Rui Gan, Keshu Wu, Xi Cheng, Xiaopeng Li, Bin Ran

TL;DR

The paper tackles the challenge of robust, end-to-end cooperative autonomous driving by fusing multiperspective vehicle and infrastructure data with textual scene descriptions through a large vision-language model. V2X-VLM employs contrastive feature alignment and a teacher–student knowledge distillation scheme to stabilize training and enhance cross-modal understanding, achieving state-of-the-art trajectory planning on the DAIR-V2X dataset. Key findings include superior L2 error and near-zero collision rates, a detailed analysis of transmission-cost trade-offs with downsampling, and demonstrated robustness to input perturbations, supporting real-time deployment. The work advances practical cooperative driving by integrating rich semantic context into planning while addressing communication constraints and training stability.

Abstract

Vehicle-to-everything (V2X) cooperation has emerged as a promising paradigm to overcome the perception limitations of classical autonomous driving by leveraging information from both ego-vehicle and infrastructure sensors. However, effectively fusing heterogeneous visual and semantic information while ensuring robust trajectory planning remains a significant challenge. This paper introduces V2X-VLM, a novel end-to-end (E2E) cooperative autonomous driving framework based on vision-language models (VLMs). V2X-VLM integrates multiperspective camera views from vehicles and infrastructure with text-based scene descriptions to enable a more comprehensive understanding of driving environments. Specifically, we propose a contrastive learning-based mechanism to reinforce the alignment of heterogeneous visual and textual characteristics, which enhances the semantic understanding of complex driving scenarios, and employ a knowledge distillation strategy to stabilize training. Experiments on a large real-world dataset demonstrate that V2X-VLM achieves state-of-the-art trajectory planning accuracy, significantly reducing L2 error and collision rate compared to existing cooperative autonomous driving baselines. Ablation studies validate the contributions of each component. Moreover, the evaluation of robustness and efficiency highlights the practicality of V2X-VLM for real-world deployment to enhance overall autonomous driving safety and decision-making.

V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

TL;DR

The paper tackles the challenge of robust, end-to-end cooperative autonomous driving by fusing multiperspective vehicle and infrastructure data with textual scene descriptions through a large vision-language model. V2X-VLM employs contrastive feature alignment and a teacher–student knowledge distillation scheme to stabilize training and enhance cross-modal understanding, achieving state-of-the-art trajectory planning on the DAIR-V2X dataset. Key findings include superior L2 error and near-zero collision rates, a detailed analysis of transmission-cost trade-offs with downsampling, and demonstrated robustness to input perturbations, supporting real-time deployment. The work advances practical cooperative driving by integrating rich semantic context into planning while addressing communication constraints and training stability.

Abstract

Vehicle-to-everything (V2X) cooperation has emerged as a promising paradigm to overcome the perception limitations of classical autonomous driving by leveraging information from both ego-vehicle and infrastructure sensors. However, effectively fusing heterogeneous visual and semantic information while ensuring robust trajectory planning remains a significant challenge. This paper introduces V2X-VLM, a novel end-to-end (E2E) cooperative autonomous driving framework based on vision-language models (VLMs). V2X-VLM integrates multiperspective camera views from vehicles and infrastructure with text-based scene descriptions to enable a more comprehensive understanding of driving environments. Specifically, we propose a contrastive learning-based mechanism to reinforce the alignment of heterogeneous visual and textual characteristics, which enhances the semantic understanding of complex driving scenarios, and employ a knowledge distillation strategy to stabilize training. Experiments on a large real-world dataset demonstrate that V2X-VLM achieves state-of-the-art trajectory planning accuracy, significantly reducing L2 error and collision rate compared to existing cooperative autonomous driving baselines. Ablation studies validate the contributions of each component. Moreover, the evaluation of robustness and efficiency highlights the practicality of V2X-VLM for real-world deployment to enhance overall autonomous driving safety and decision-making.
Paper Structure (40 sections, 22 equations, 7 figures, 5 tables, 2 algorithms)

This paper contains 40 sections, 22 equations, 7 figures, 5 tables, 2 algorithms.

Figures (7)

  • Figure 1: Overview of end-to-end autonomous driving pipelines. (a) A cooperative driving scenario where infrastructure sensors supplement the ego vehicle’s limited field of view; (b.1) the classical end-to-end pipeline that relies solely on on-board sensor data; (b.2) a VLM-based end-to-end system that integrates multimodal reasoning within a single vehicle; (b.3) UniV2X—the pioneering end-to-end cooperative autonomous driving pipeline that fuses vehicle and infrastructure data; and (b.4) our proposed V2X-VLM framework, which leverages large VLM to unify multimodel data for robust end-to-end trajectory planning.
  • Figure 2: Overview of the Proposed V2X-VLM Framework. Camera images from the vehicle and infrastructure sides merged with semantic text prompt are fed in a VLM backbone for multiperspective and multimodel data fusion. Through comprehensive scene understanding and reasoning, V2X-VLM delivers accurate and reliable E2E trajectory planning. Contrastive learning-based feature alignment is applied during fine tuning to ensure the effective fusion of visual and semantic features for enhanced scene understanding. Knowledge distillation stabilizes the learning process to fulfill the complex E2E autonomous driving task.
  • Figure 3: Visualization of V2X-VLM trajectory planning on three common driving scenarios. Continuous frames are visualized at a frequency of 1 Hz.
  • Figure 4: Visualization of V2X-VLM's trajectory planning for going-straight scenarios in challenging corner cases. Continuous frames are displayed at a frequency of 1 Hz.
  • Figure 5: Visualization of V2X-VLM's trajectory planning for right-trun scenarios in challenging corner cases. Continuous frames are displayed at a frequency of 1 Hz.
  • ...and 2 more figures