Table of Contents
Fetching ...

Efficient Driving Behavior Narration and Reasoning on Edge Device Using Large Language Models

Yizhou Huang, Yihua Cheng, Kezhi Wang

TL;DR

This work integrates large language models with edge computing by deploying LLMs on roadside units (RSUs) connected via 5G to perform real-time narration and reasoning of driving behaviors. A three-stream multi-modal prompt strategy (environment, agent, motion) is used to convert visual features into structured prompts, enabling memory-enhanced reasoning and localized descriptions while reducing data transmission. Evaluations on the OpenDV-Youtube dataset show that the multi-modal prompts significantly improve narration and reasoning accuracy across multiple LLMs, with Video-ChatGPT achieving up to 78.2% narration and 81.7% reasoning under prompting, and edge deployment delivering sub-second per-frame responses (0.3–0.5s for small frame counts). The framework also demonstrates substantial speed advantages over centralized baselines (e.g., ADAPT) and enables rapid inter-RSU alerts via 5G/C-V2X, highlighting practical benefits for real-time autonomous driving and safety-critical decision-making.

Abstract

Deep learning architectures with powerful reasoning capabilities have driven significant advancements in autonomous driving technology. Large language models (LLMs) applied in this field can describe driving scenes and behaviors with a level of accuracy similar to human perception, particularly in visual tasks. Meanwhile, the rapid development of edge computing, with its advantage of proximity to data sources, has made edge devices increasingly important in autonomous driving. Edge devices process data locally, reducing transmission delays and bandwidth usage, and achieving faster response times. In this work, we propose a driving behavior narration and reasoning framework that applies LLMs to edge devices. The framework consists of multiple roadside units, with LLMs deployed on each unit. These roadside units collect road data and communicate via 5G NSR/NR networks. Our experiments show that LLMs deployed on edge devices can achieve satisfactory response speeds. Additionally, we propose a prompt strategy to enhance the narration and reasoning performance of the system. This strategy integrates multi-modal information, including environmental, agent, and motion data. Experiments conducted on the OpenDV-Youtube dataset demonstrate that our approach significantly improves performance across both tasks.

Efficient Driving Behavior Narration and Reasoning on Edge Device Using Large Language Models

TL;DR

This work integrates large language models with edge computing by deploying LLMs on roadside units (RSUs) connected via 5G to perform real-time narration and reasoning of driving behaviors. A three-stream multi-modal prompt strategy (environment, agent, motion) is used to convert visual features into structured prompts, enabling memory-enhanced reasoning and localized descriptions while reducing data transmission. Evaluations on the OpenDV-Youtube dataset show that the multi-modal prompts significantly improve narration and reasoning accuracy across multiple LLMs, with Video-ChatGPT achieving up to 78.2% narration and 81.7% reasoning under prompting, and edge deployment delivering sub-second per-frame responses (0.3–0.5s for small frame counts). The framework also demonstrates substantial speed advantages over centralized baselines (e.g., ADAPT) and enables rapid inter-RSU alerts via 5G/C-V2X, highlighting practical benefits for real-time autonomous driving and safety-critical decision-making.

Abstract

Deep learning architectures with powerful reasoning capabilities have driven significant advancements in autonomous driving technology. Large language models (LLMs) applied in this field can describe driving scenes and behaviors with a level of accuracy similar to human perception, particularly in visual tasks. Meanwhile, the rapid development of edge computing, with its advantage of proximity to data sources, has made edge devices increasingly important in autonomous driving. Edge devices process data locally, reducing transmission delays and bandwidth usage, and achieving faster response times. In this work, we propose a driving behavior narration and reasoning framework that applies LLMs to edge devices. The framework consists of multiple roadside units, with LLMs deployed on each unit. These roadside units collect road data and communicate via 5G NSR/NR networks. Our experiments show that LLMs deployed on edge devices can achieve satisfactory response speeds. Additionally, we propose a prompt strategy to enhance the narration and reasoning performance of the system. This strategy integrates multi-modal information, including environmental, agent, and motion data. Experiments conducted on the OpenDV-Youtube dataset demonstrate that our approach significantly improves performance across both tasks.
Paper Structure (13 sections, 2 figures, 3 tables)

This paper contains 13 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: (a) Overall system framework. The LLM deployed on RSUs as edge servers, receives input data from edge users via 5G NR/NSA communication technology. It analyzes input data to generate corresponding driving behavior narration and reasoning. Different RSUs can communicate with each other and share information. Finally, the generated textual descriptions are broadcast globally between edge devices. (b) Workflow of our framework deployed on RSU. First, edge users collect surrounding road information and upload to the RSU server using an IP address generated by our framework. LLM generates text-based outputs, which can be accessed through a real-time visualization window for backend queries.
  • Figure 2: Comparison between enabling and disabling three-stream prompt. LLM is able to generate description of driving behivour based on three streams. The observed results can trigger various keywords, such as the environment keyword "visibility," the agent keyword "pedestrian crossing," and the motion keyword "stop," among others.