Multi-Agent AI Framework for Road Situation Detection and C-ITS Message Generation
Kailin Tong, Selim Solmaz, Kenan Mujkic, Gottfried Allmer, Bo Leng
TL;DR
This work introduces ESERCOM-D, a multi-agent AI framework that merges multimodal large language models with infrastructure-based perception to detect road hazards and generate ETSI-compliant DENM messages. The pipeline, consisting of Situation Detection, Distance Estimation, and Message Generation agents, uses ASN.1 UPER encoding for RSU transmission and is monitored via LangSmith. Evaluated on a curated 103-frame subset from the TAD dataset with Gemini-2.0-Flash and Gemini-2.5-Flash, the approach achieves perfect recall and DENM schema validity but shows limited accuracy in lane-related attributes and notable latency differences, with the older model often outperforming the newer one in structured ITS tasks. The results underscore the need for specialized fine-tuning of ITS-focused MLLMs and point toward real-world motorway demonstrations to validate deployment viability and safety benefits.
Abstract
Conventional road-situation detection methods achieve strong performance in predefined scenarios but fail in unseen cases and lack semantic interpretation, which is crucial for reliable traffic recommendations. This work introduces a multi-agent AI framework that combines multimodal large language models (MLLMs) with vision-based perception for road-situation monitoring. The framework processes camera feeds and coordinates dedicated agents for situation detection, distance estimation, decision-making, and Cooperative Intelligent Transport System (C-ITS) message generation. Evaluation is conducted on a custom dataset of 103 images extracted from 20 videos of the TAD dataset. Both Gemini-2.0-Flash and Gemini-2.5-Flash were evaluated. The results show 100\% recall in situation detection and perfect message schema correctness; however, both models suffer from false-positive detections and have reduced performance in terms of number of lanes, driving lane status and cause code. Surprisingly, Gemini-2.5-Flash, though more capable in general tasks, underperforms Gemini-2.0-Flash in detection accuracy and semantic understanding and incurs higher latency (Table II). These findings motivate further work on fine-tuning specialized LLMs or MLLMs tailored for intelligent transportation applications.
