Table of Contents
Fetching ...

Tapping in a Remote Vehicle's onboard LLM to Complement the Ego Vehicle's Field-of-View

Malsha Ashani Mahawatta Dona, Beatriz Cabrero-Daniel, Yinan Yu, Christian Berger

TL;DR

The paper investigates using onboard large language models as a communication interface to enrich an ego vehicle's field-of-view in occluded traffic scenarios. It evaluates GPT-4V and GPT-4o on a Waymo-derived dataset to detect pedestrians and generate bounding boxes, revealing strong detection performance but unreliable localization due to high variability and environmental factors. Prompt engineering modestly improves recall for localization, but overall precision and IoU remain insufficient for robust object localization in real-time driving contexts. The work demonstrates the potential and current limitations of LLM-based inter-vehicle dialogue as a lightweight alternative to data streaming, highlighting the need for standardized messaging and further research into trustworthy, multi-agent perception systems.

Abstract

Today's advanced automotive systems are turning into intelligent Cyber-Physical Systems (CPS), bringing computational intelligence to their cyber-physical context. Such systems power advanced driver assistance systems (ADAS) that observe a vehicle's surroundings for their functionality. However, such ADAS have clear limitations in scenarios when the direct line-of-sight to surrounding objects is occluded, like in urban areas. Imagine now automated driving (AD) systems that ideally could benefit from other vehicles' field-of-view in such occluded situations to increase traffic safety if, for example, locations about pedestrians can be shared across vehicles. Current literature suggests vehicle-to-infrastructure (V2I) via roadside units (RSUs) or vehicle-to-vehicle (V2V) communication to address such issues that stream sensor or object data between vehicles. When considering the ongoing revolution in vehicle system architectures towards powerful, centralized processing units with hardware accelerators, foreseeing the onboard presence of large language models (LLMs) to improve the passengers' comfort when using voice assistants becomes a reality. We are suggesting and evaluating a concept to complement the ego vehicle's field-of-view (FOV) with another vehicle's FOV by tapping into their onboard LLM to let the machines have a dialogue about what the other vehicle ``sees''. Our results show that very recent versions of LLMs, such as GPT-4V and GPT-4o, understand a traffic situation to an impressive level of detail, and hence, they can be used even to spot traffic participants. However, better prompts are needed to improve the detection quality and future work is needed towards a standardised message interchange format between vehicles.

Tapping in a Remote Vehicle's onboard LLM to Complement the Ego Vehicle's Field-of-View

TL;DR

The paper investigates using onboard large language models as a communication interface to enrich an ego vehicle's field-of-view in occluded traffic scenarios. It evaluates GPT-4V and GPT-4o on a Waymo-derived dataset to detect pedestrians and generate bounding boxes, revealing strong detection performance but unreliable localization due to high variability and environmental factors. Prompt engineering modestly improves recall for localization, but overall precision and IoU remain insufficient for robust object localization in real-time driving contexts. The work demonstrates the potential and current limitations of LLM-based inter-vehicle dialogue as a lightweight alternative to data streaming, highlighting the need for standardized messaging and further research into trustworthy, multi-agent perception systems.

Abstract

Today's advanced automotive systems are turning into intelligent Cyber-Physical Systems (CPS), bringing computational intelligence to their cyber-physical context. Such systems power advanced driver assistance systems (ADAS) that observe a vehicle's surroundings for their functionality. However, such ADAS have clear limitations in scenarios when the direct line-of-sight to surrounding objects is occluded, like in urban areas. Imagine now automated driving (AD) systems that ideally could benefit from other vehicles' field-of-view in such occluded situations to increase traffic safety if, for example, locations about pedestrians can be shared across vehicles. Current literature suggests vehicle-to-infrastructure (V2I) via roadside units (RSUs) or vehicle-to-vehicle (V2V) communication to address such issues that stream sensor or object data between vehicles. When considering the ongoing revolution in vehicle system architectures towards powerful, centralized processing units with hardware accelerators, foreseeing the onboard presence of large language models (LLMs) to improve the passengers' comfort when using voice assistants becomes a reality. We are suggesting and evaluating a concept to complement the ego vehicle's field-of-view (FOV) with another vehicle's FOV by tapping into their onboard LLM to let the machines have a dialogue about what the other vehicle ``sees''. Our results show that very recent versions of LLMs, such as GPT-4V and GPT-4o, understand a traffic situation to an impressive level of detail, and hence, they can be used even to spot traffic participants. However, better prompts are needed to improve the detection quality and future work is needed towards a standardised message interchange format between vehicles.
Paper Structure (19 sections, 7 figures, 3 tables)

This paper contains 19 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Graphical representation of a looking around the corner problem: The ego vehicle is approaching an intersection together with vehicles A and B. Each vehicle has its own FOV represented by their respective colours, and that information could help each other understand the complete traffic situation at hand.
  • Figure 2: Detailed answers obtained from Microsoft Copilot. The question "Do you see a pedestrian in this image?" was used as the prompt. The input image is taken from the Waymo dataset waymoDataset and represents a sunset.
  • Figure 3: Exploring an LLM's capabilities not only to describe but also to locate a pedestrian: (a) shows the input image (taken from womancrossingroad_reference), (b) the DALL.E 2 generated image (b), and differences between both.
  • Figure 4: Intersection-over-Union (IoU) percentages for the images that share an overlapping between the GPT-4V generated and the ground truth bounding boxes, zooming in the overlapping IoUs. The pie chart contains 15 slices.
  • Figure 5: Images from Waymo DatasetDatasetReference with a GPT-4V and GPT-4o generated bounding box (red) almost completely covering the ground truth area (green). The selected images show day and night scenarios. The results (a) and (b) are retrieved from the GPT-4V model, and the (c) and (d) are retrieved from the GPT-4o model.
  • ...and 2 more figures