Human-in-the-loop Reasoning For Traffic Sign Detection: Collaborative Approach Yolo With Video-llava
Mehdi Azarafza, Fatima Idrees, Ali Ehteshami Bejnordi, Charles Steinmetz, Stefan Henkler, Achim Rettberg
TL;DR
This work tackles TSR reliability for autonomous driving under adverse weather by integrating a human-in-the-loop with Video-LLaVA to guide YOLO v8 detections. The method iteratively propagates YOLO outputs to a multimodal LVLM (Video-LLaVA) and uses human prompts to refine focal ROI and reasoning; if necessary, the human-guided prompts steer the LVLM to correctly identify speed limits. The contributions include a HITL pipeline combining YOLO with Video-LLaVA, analysis of two challenging scenarios, and empirical evidence of accuracy gains (70% vs 50%/55%), demonstrating the value of interactive multimodal reasoning for TSR. This approach shows practical potential to improve TSR in semi-real-world conditions, enabling more robust autonomous driving perception and suggesting avenues for extending to broader traffic-sign recognition tasks.
Abstract
Traffic Sign Recognition (TSR) detection is a crucial component of autonomous vehicles. While You Only Look Once (YOLO) is a popular real-time object detection algorithm, factors like training data quality and adverse weather conditions (e.g., heavy rain) can lead to detection failures. These failures can be particularly dangerous when visual similarities between objects exist, such as mistaking a 30 km/h sign for a higher speed limit sign. This paper proposes a method that combines video analysis and reasoning, prompting with a human-in-the-loop guide large vision model to improve YOLOs accuracy in detecting road speed limit signs, especially in semi-real-world conditions. It is hypothesized that the guided prompting and reasoning abilities of Video-LLava can enhance YOLOs traffic sign detection capabilities. This hypothesis is supported by an evaluation based on human-annotated accuracy metrics within a dataset of recorded videos from the CARLA car simulator. The results demonstrate that a collaborative approach combining YOLO with Video-LLava and reasoning can effectively address challenging situations such as heavy rain and overcast conditions that hinder YOLOs detection capabilities.
