Table of Contents
Fetching ...

V2V-GoT: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multimodal Large Language Models and Graph-of-Thoughts

Hsu-kuang Chiu, Ryo Hachiuma, Chien-Yi Wang, Yu-Chiang Frank Wang, Min-Hung Chen, Stephen F. Smith

TL;DR

This work introduces V2V-GoT, a graph-of-thoughts framework that enables Multimodal Large Language Models to coordinate cooperative driving among connected autonomous vehicles. It traces occlusion-aware perception and planning-aware prediction through a nine-node QA graph, powered by a GoT-enabled MLLM built on LLaVA with temporal LiDAR features from multiple CAVs. The authors curate the V2V-GoT-QA dataset (based on V2V4Real) and demonstrate that V2V-GoT outperforms strong baselines in perception, prediction, and planning tasks, with ablations validating the value of occlusion-aware and planning-aware components. Communication costs remain comparable to prior V2V-LLM approaches, while task performance improves, suggesting practical viability for real-world cooperative driving. The work also provides open-source data and code to accelerate future research in GoT-enabled cooperative autonomous driving.

Abstract

Current state-of-the-art autonomous vehicles could face safety-critical situations when their local sensors are occluded by large nearby objects on the road. Vehicle-to-vehicle (V2V) cooperative autonomous driving has been proposed as a means of addressing this problem, and one recently introduced framework for cooperative autonomous driving has further adopted an approach that incorporates a Multimodal Large Language Model (MLLM) to integrate cooperative perception and planning processes. However, despite the potential benefit of applying graph-of-thoughts reasoning to the MLLM, this idea has not been considered by previous cooperative autonomous driving research. In this paper, we propose a novel graph-of-thoughts framework specifically designed for MLLM-based cooperative autonomous driving. Our graph-of-thoughts includes our proposed novel ideas of occlusion-aware perception and planning-aware prediction. We curate the V2V-GoT-QA dataset and develop the V2V-GoT model for training and testing the cooperative driving graph-of-thoughts. Our experimental results show that our method outperforms other baselines in cooperative perception, prediction, and planning tasks. Our project website: https://eddyhkchiu.github.io/v2vgot.github.io/ .

V2V-GoT: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multimodal Large Language Models and Graph-of-Thoughts

TL;DR

This work introduces V2V-GoT, a graph-of-thoughts framework that enables Multimodal Large Language Models to coordinate cooperative driving among connected autonomous vehicles. It traces occlusion-aware perception and planning-aware prediction through a nine-node QA graph, powered by a GoT-enabled MLLM built on LLaVA with temporal LiDAR features from multiple CAVs. The authors curate the V2V-GoT-QA dataset (based on V2V4Real) and demonstrate that V2V-GoT outperforms strong baselines in perception, prediction, and planning tasks, with ablations validating the value of occlusion-aware and planning-aware components. Communication costs remain comparable to prior V2V-LLM approaches, while task performance improves, suggesting practical viability for real-world cooperative driving. The work also provides open-source data and code to accelerate future research in GoT-enabled cooperative autonomous driving.

Abstract

Current state-of-the-art autonomous vehicles could face safety-critical situations when their local sensors are occluded by large nearby objects on the road. Vehicle-to-vehicle (V2V) cooperative autonomous driving has been proposed as a means of addressing this problem, and one recently introduced framework for cooperative autonomous driving has further adopted an approach that incorporates a Multimodal Large Language Model (MLLM) to integrate cooperative perception and planning processes. However, despite the potential benefit of applying graph-of-thoughts reasoning to the MLLM, this idea has not been considered by previous cooperative autonomous driving research. In this paper, we propose a novel graph-of-thoughts framework specifically designed for MLLM-based cooperative autonomous driving. Our graph-of-thoughts includes our proposed novel ideas of occlusion-aware perception and planning-aware prediction. We curate the V2V-GoT-QA dataset and develop the V2V-GoT model for training and testing the cooperative driving graph-of-thoughts. Our experimental results show that our method outperforms other baselines in cooperative perception, prediction, and planning tasks. Our project website: https://eddyhkchiu.github.io/v2vgot.github.io/ .

Paper Structure

This paper contains 24 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Illustration of our proposed graph-of-thoughts reasoning framework for cooperative autonomous driving. All Connected Autonomous Vehicles (CAVs) share their perception features with the Multimodal Large Language Model (MLLM), as illustrated by the grey arrows. Any CAV can ask the MLLM to provide a suggested future trajectory or answer perception or prediction questions. The MLLM fuses the perception features from all CAVs and performs inference by following the graph-of-thoughts. If two QA nodes are connected by a directed edge in the graph, as illustrated by black arrows, the answer of the parent node QA is used as the input context of the child node QA. Other colored curved arrows illustrate the predicted or suggested future trajectories. Color stars represent current locations of objects, predicted or suggested future waypoints.
  • Figure 2: Illustration of V2V-GoT-QA's $9$ types of QA pairs: Perception (Q1 - Q4), Prediction (Q5 - Q7), and Planning (Q8 - Q9). The black arrows pointing at the MLLM indicate the perception data from CAVs. Other colored arrows represent predicted or suggested future trajectories.
  • Figure 3: Model architecture of V2V-GoT.
  • Figure 4: Different graph-of-thoughts structures for cooperative autonomous driving. The QA types include Perception (Q1 - Q4), Prediction (Q5 - Q7), and Planning (Q8 - Q9). If two nodes are connected by a directed edge, the answer of the parent node QA is used as the input context of the child node QA.
  • Figure 5: Qualitative testing sample result of V2V-GoT on Q4. Overall Notable Objects (Figure \ref{['fig:q4_illustration']}). The context information is from the testing inference output of parent question Q3. Invisible Notable Objects (Figure \ref{['fig:q3_illustration']}) and Q1. Visible Notable Objects (Figure \ref{['fig:q1_illustration']}). Magenta $\times$: current location of the asking CAV. Magenta curve: reference trajectory in the question. Yellow $\times$: model output. Green $\circ$: ground-truth answer.
  • ...and 2 more figures