Table of Contents
Fetching ...

Multimodal Large Language Model Framework for Safe and Interpretable Grid-Integrated EVs

Jean Douglas Carvalho, Hugo Kenji, Ahmad Mohammad Saber, Glaucia Melo, Max Mauro Dias Santos, Deepa Kundur

TL;DR

This work presents a multimodal large language model framework that ingests visual data (YOLOv8), semantic context (Cityscapes), CAN bus telemetry, and geolocation to generate natural-language driver alerts for safe urban EV operation within smart grids. By converting multimodal sensor inputs into structured textual prompts, the approach enables interpretable reasoning and context-aware guidance, validated on real-world urban data. The study compares text-only LLMs with a multimodal LLM, demonstrating feasible latency and strong agreement with human expert assessments across critical scenarios, including pedestrians, proximal vehicles, and complex intersections. The proposed modular design supports scalable deployment and potential benefits for fleet coordination, EV load forecasting, and traffic-aware energy planning in smart-grid ecosystems.

Abstract

The integration of electric vehicles (EVs) into smart grids presents unique opportunities to enhance both transportation systems and energy networks. However, ensuring safe and interpretable interactions between drivers, vehicles, and the surrounding environment remains a critical challenge. This paper presents a multi-modal large language model (LLM)-based framework to process multimodal sensor data - such as object detection, semantic segmentation, and vehicular telemetry - and generate natural-language alerts for drivers. The framework is validated using real-world data collected from instrumented vehicles driving on urban roads, ensuring its applicability to real-world scenarios. By combining visual perception (YOLOv8), geocoded positioning, and CAN bus telemetry, the framework bridges raw sensor data and driver comprehension, enabling safer and more informed decision-making in urban driving scenarios. Case studies using real data demonstrate the framework's effectiveness in generating context-aware alerts for critical situations, such as proximity to pedestrians, cyclists, and other vehicles. This paper highlights the potential of LLMs as assistive tools in e-mobility, benefiting both transportation systems and electric networks by enabling scalable fleet coordination, EV load forecasting, and traffic-aware energy planning. Index Terms - Electric vehicles, visual perception, large language models, YOLOv8, semantic segmentation, CAN bus, prompt engineering, smart grid.

Multimodal Large Language Model Framework for Safe and Interpretable Grid-Integrated EVs

TL;DR

This work presents a multimodal large language model framework that ingests visual data (YOLOv8), semantic context (Cityscapes), CAN bus telemetry, and geolocation to generate natural-language driver alerts for safe urban EV operation within smart grids. By converting multimodal sensor inputs into structured textual prompts, the approach enables interpretable reasoning and context-aware guidance, validated on real-world urban data. The study compares text-only LLMs with a multimodal LLM, demonstrating feasible latency and strong agreement with human expert assessments across critical scenarios, including pedestrians, proximal vehicles, and complex intersections. The proposed modular design supports scalable deployment and potential benefits for fleet coordination, EV load forecasting, and traffic-aware energy planning in smart-grid ecosystems.

Abstract

The integration of electric vehicles (EVs) into smart grids presents unique opportunities to enhance both transportation systems and energy networks. However, ensuring safe and interpretable interactions between drivers, vehicles, and the surrounding environment remains a critical challenge. This paper presents a multi-modal large language model (LLM)-based framework to process multimodal sensor data - such as object detection, semantic segmentation, and vehicular telemetry - and generate natural-language alerts for drivers. The framework is validated using real-world data collected from instrumented vehicles driving on urban roads, ensuring its applicability to real-world scenarios. By combining visual perception (YOLOv8), geocoded positioning, and CAN bus telemetry, the framework bridges raw sensor data and driver comprehension, enabling safer and more informed decision-making in urban driving scenarios. Case studies using real data demonstrate the framework's effectiveness in generating context-aware alerts for critical situations, such as proximity to pedestrians, cyclists, and other vehicles. This paper highlights the potential of LLMs as assistive tools in e-mobility, benefiting both transportation systems and electric networks by enabling scalable fleet coordination, EV load forecasting, and traffic-aware energy planning. Index Terms - Electric vehicles, visual perception, large language models, YOLOv8, semantic segmentation, CAN bus, prompt engineering, smart grid.

Paper Structure

This paper contains 11 sections, 1 equation, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Raw frame of Scenario 1.
  • Figure 2: YOLOv8 detections in Scenario 1, showing pedestrians close to the ego vehicle and surrounding traffic.
  • Figure 3: Cityscapes segmentation of Scenario 1, with color-coded static and structural elements.
  • Figure 4: Scenario 3: Complex urban case in São Paulo.