Table of Contents
Fetching ...

Multi-Agent AI Framework for Road Situation Detection and C-ITS Message Generation

Kailin Tong, Selim Solmaz, Kenan Mujkic, Gottfried Allmer, Bo Leng

TL;DR

This work introduces ESERCOM-D, a multi-agent AI framework that merges multimodal large language models with infrastructure-based perception to detect road hazards and generate ETSI-compliant DENM messages. The pipeline, consisting of Situation Detection, Distance Estimation, and Message Generation agents, uses ASN.1 UPER encoding for RSU transmission and is monitored via LangSmith. Evaluated on a curated 103-frame subset from the TAD dataset with Gemini-2.0-Flash and Gemini-2.5-Flash, the approach achieves perfect recall and DENM schema validity but shows limited accuracy in lane-related attributes and notable latency differences, with the older model often outperforming the newer one in structured ITS tasks. The results underscore the need for specialized fine-tuning of ITS-focused MLLMs and point toward real-world motorway demonstrations to validate deployment viability and safety benefits.

Abstract

Conventional road-situation detection methods achieve strong performance in predefined scenarios but fail in unseen cases and lack semantic interpretation, which is crucial for reliable traffic recommendations. This work introduces a multi-agent AI framework that combines multimodal large language models (MLLMs) with vision-based perception for road-situation monitoring. The framework processes camera feeds and coordinates dedicated agents for situation detection, distance estimation, decision-making, and Cooperative Intelligent Transport System (C-ITS) message generation. Evaluation is conducted on a custom dataset of 103 images extracted from 20 videos of the TAD dataset. Both Gemini-2.0-Flash and Gemini-2.5-Flash were evaluated. The results show 100\% recall in situation detection and perfect message schema correctness; however, both models suffer from false-positive detections and have reduced performance in terms of number of lanes, driving lane status and cause code. Surprisingly, Gemini-2.5-Flash, though more capable in general tasks, underperforms Gemini-2.0-Flash in detection accuracy and semantic understanding and incurs higher latency (Table II). These findings motivate further work on fine-tuning specialized LLMs or MLLMs tailored for intelligent transportation applications.

Multi-Agent AI Framework for Road Situation Detection and C-ITS Message Generation

TL;DR

This work introduces ESERCOM-D, a multi-agent AI framework that merges multimodal large language models with infrastructure-based perception to detect road hazards and generate ETSI-compliant DENM messages. The pipeline, consisting of Situation Detection, Distance Estimation, and Message Generation agents, uses ASN.1 UPER encoding for RSU transmission and is monitored via LangSmith. Evaluated on a curated 103-frame subset from the TAD dataset with Gemini-2.0-Flash and Gemini-2.5-Flash, the approach achieves perfect recall and DENM schema validity but shows limited accuracy in lane-related attributes and notable latency differences, with the older model often outperforming the newer one in structured ITS tasks. The results underscore the need for specialized fine-tuning of ITS-focused MLLMs and point toward real-world motorway demonstrations to validate deployment viability and safety benefits.

Abstract

Conventional road-situation detection methods achieve strong performance in predefined scenarios but fail in unseen cases and lack semantic interpretation, which is crucial for reliable traffic recommendations. This work introduces a multi-agent AI framework that combines multimodal large language models (MLLMs) with vision-based perception for road-situation monitoring. The framework processes camera feeds and coordinates dedicated agents for situation detection, distance estimation, decision-making, and Cooperative Intelligent Transport System (C-ITS) message generation. Evaluation is conducted on a custom dataset of 103 images extracted from 20 videos of the TAD dataset. Both Gemini-2.0-Flash and Gemini-2.5-Flash were evaluated. The results show 100\% recall in situation detection and perfect message schema correctness; however, both models suffer from false-positive detections and have reduced performance in terms of number of lanes, driving lane status and cause code. Surprisingly, Gemini-2.5-Flash, though more capable in general tasks, underperforms Gemini-2.0-Flash in detection accuracy and semantic understanding and incurs higher latency (Table II). These findings motivate further work on fine-tuning specialized LLMs or MLLMs tailored for intelligent transportation applications.

Paper Structure

This paper contains 8 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Exemplary scenarios: in-lane offset recommendation (cyan trajectory) and lane change recommendation (red trajectory) tong2024connectgpt.
  • Figure 2: Pipeline of the proposed architecture: images from infrastructure cameras are analyzed by the Situation Detection Agent, distance is estimated by Depth Pro, and the Message Generation Agent produces standardized DENMs for transmission via Road Side Unit (RSU) devices.