Table of Contents
Fetching ...

V2V-LLM: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multi-Modal Large Language Models

Hsu-kuang Chiu, Ryo Hachiuma, Chien-Yi Wang, Stephen F. Smith, Yu-Chiang Frank Wang, Min-Hung Chen

TL;DR

This work introduces V2V-QA, a dataset and benchmark for Vehicle-to-Vehicle cooperative autonomous driving using a centralized Multi-Modal Large Language Model (LLM). The proposed V2V-LLM fuses scene-level maps and object-level LiDAR features from multiple CAVs and answers grounding, notable object identification, and planning questions, enabling end-to-end cooperative perception and planning. Empirical results show V2V-LLM achieves strong performance on planning and notable object identification, with competitive grounding, while incurring modest communication overhead; ablations highlight the value of object-level features and pre-training. By releasing the V2V-QA dataset and code, the work opens a path toward unified, LLM-driven cooperative driving that can enhance safety in real-world deployments.

Abstract

Current autonomous driving vehicles rely mainly on their individual sensors to understand surrounding scenes and plan for future trajectories, which can be unreliable when the sensors are malfunctioning or occluded. To address this problem, cooperative perception methods via vehicle-to-vehicle (V2V) communication have been proposed, but they have tended to focus on perception tasks like detection or tracking. How those approaches contribute to overall cooperative planning performance is still under-explored. Inspired by recent progress using Large Language Models (LLMs) to build autonomous driving systems, we propose a novel problem setting that integrates a Multi-Modal LLM into cooperative autonomous driving, with the proposed Vehicle-to-Vehicle Question-Answering (V2V-QA) dataset and benchmark. We also propose our baseline method Vehicle-to-Vehicle Multi-Modal Large Language Model (V2V-LLM), which uses an LLM to fuse perception information from multiple connected autonomous vehicles (CAVs) and answer various types of driving-related questions: grounding, notable object identification, and planning. Experimental results show that our proposed V2V-LLM can be a promising unified model architecture for performing various tasks in cooperative autonomous driving, and outperforms other baseline methods that use different fusion approaches. Our work also creates a new research direction that can improve the safety of future autonomous driving systems. The code and data will be released to the public to facilitate open-source research in this field. Our project website: https://eddyhkchiu.github.io/v2vllm.github.io/ .

V2V-LLM: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multi-Modal Large Language Models

TL;DR

This work introduces V2V-QA, a dataset and benchmark for Vehicle-to-Vehicle cooperative autonomous driving using a centralized Multi-Modal Large Language Model (LLM). The proposed V2V-LLM fuses scene-level maps and object-level LiDAR features from multiple CAVs and answers grounding, notable object identification, and planning questions, enabling end-to-end cooperative perception and planning. Empirical results show V2V-LLM achieves strong performance on planning and notable object identification, with competitive grounding, while incurring modest communication overhead; ablations highlight the value of object-level features and pre-training. By releasing the V2V-QA dataset and code, the work opens a path toward unified, LLM-driven cooperative driving that can enhance safety in real-world deployments.

Abstract

Current autonomous driving vehicles rely mainly on their individual sensors to understand surrounding scenes and plan for future trajectories, which can be unreliable when the sensors are malfunctioning or occluded. To address this problem, cooperative perception methods via vehicle-to-vehicle (V2V) communication have been proposed, but they have tended to focus on perception tasks like detection or tracking. How those approaches contribute to overall cooperative planning performance is still under-explored. Inspired by recent progress using Large Language Models (LLMs) to build autonomous driving systems, we propose a novel problem setting that integrates a Multi-Modal LLM into cooperative autonomous driving, with the proposed Vehicle-to-Vehicle Question-Answering (V2V-QA) dataset and benchmark. We also propose our baseline method Vehicle-to-Vehicle Multi-Modal Large Language Model (V2V-LLM), which uses an LLM to fuse perception information from multiple connected autonomous vehicles (CAVs) and answer various types of driving-related questions: grounding, notable object identification, and planning. Experimental results show that our proposed V2V-LLM can be a promising unified model architecture for performing various tasks in cooperative autonomous driving, and outperforms other baseline methods that use different fusion approaches. Our work also creates a new research direction that can improve the safety of future autonomous driving systems. The code and data will be released to the public to facilitate open-source research in this field. Our project website: https://eddyhkchiu.github.io/v2vllm.github.io/ .

Paper Structure

This paper contains 31 sections, 32 figures, 15 tables.

Figures (32)

  • Figure 1: Overview of our problem setting of LLM-based cooperative autonomous driving. All CAVs share their perception information with the LLM. Any CAV can ask the LLM a question to obtain useful information for driving safety.
  • Figure 2: Illustration of V2V-QA's $5$ types of QA pairs. The arrows pointing at LLM indicate the perception data from CAVs.
  • Figure 3: Model diagram of our proposed V2V-LLM for cooperative autonomous driving.
  • Figure 4: Feature encoder diagrams of the baseline methods from different fusion approaches.
  • Figure 5: V2V-LLM's grounding results on V2V-QA's testing set. Magenta $\circ$: reference locations in questions. Yellow $+$: model output locations. Green $\circ$: ground-truth answers.
  • ...and 27 more figures