Human-in-the-loop Reasoning For Traffic Sign Detection: Collaborative Approach Yolo With Video-llava

Mehdi Azarafza; Fatima Idrees; Ali Ehteshami Bejnordi; Charles Steinmetz; Stefan Henkler; Achim Rettberg

Human-in-the-loop Reasoning For Traffic Sign Detection: Collaborative Approach Yolo With Video-llava

Mehdi Azarafza, Fatima Idrees, Ali Ehteshami Bejnordi, Charles Steinmetz, Stefan Henkler, Achim Rettberg

TL;DR

This work tackles TSR reliability for autonomous driving under adverse weather by integrating a human-in-the-loop with Video-LLaVA to guide YOLO v8 detections. The method iteratively propagates YOLO outputs to a multimodal LVLM (Video-LLaVA) and uses human prompts to refine focal ROI and reasoning; if necessary, the human-guided prompts steer the LVLM to correctly identify speed limits. The contributions include a HITL pipeline combining YOLO with Video-LLaVA, analysis of two challenging scenarios, and empirical evidence of accuracy gains (70% vs 50%/55%), demonstrating the value of interactive multimodal reasoning for TSR. This approach shows practical potential to improve TSR in semi-real-world conditions, enabling more robust autonomous driving perception and suggesting avenues for extending to broader traffic-sign recognition tasks.

Abstract

Traffic Sign Recognition (TSR) detection is a crucial component of autonomous vehicles. While You Only Look Once (YOLO) is a popular real-time object detection algorithm, factors like training data quality and adverse weather conditions (e.g., heavy rain) can lead to detection failures. These failures can be particularly dangerous when visual similarities between objects exist, such as mistaking a 30 km/h sign for a higher speed limit sign. This paper proposes a method that combines video analysis and reasoning, prompting with a human-in-the-loop guide large vision model to improve YOLOs accuracy in detecting road speed limit signs, especially in semi-real-world conditions. It is hypothesized that the guided prompting and reasoning abilities of Video-LLava can enhance YOLOs traffic sign detection capabilities. This hypothesis is supported by an evaluation based on human-annotated accuracy metrics within a dataset of recorded videos from the CARLA car simulator. The results demonstrate that a collaborative approach combining YOLO with Video-LLava and reasoning can effectively address challenging situations such as heavy rain and overcast conditions that hinder YOLOs detection capabilities.

Human-in-the-loop Reasoning For Traffic Sign Detection: Collaborative Approach Yolo With Video-llava

TL;DR

Abstract

Paper Structure (10 sections, 1 equation, 6 figures)

This paper contains 10 sections, 1 equation, 6 figures.

Introduction
Related work
Large Vision-Language Model
Object detection reasoning with Video-LLaVA
Analysis of scenarios
Scenario 1
Scenario 2
Evaluation
Conclusion
Acknowledgment

Figures (6)

Figure 1: High level overview of reasoning with llava-video
Figure 2: Video-LLaVA structure lin2023video
Figure 3: Activity Diagram for the Collaborative Approach of Human-in-the-Loop Reasoning Using YOLO and Video-LLaVA
Figure 4: a) YOLO Output: Speed Limit Sign Roads 30, Detected as 60 (Blue Font 'Sign 60'). b) Step-by-Step human-in-the-loop with Video-LLaVA: Refining Detection with Regioning
Figure 5: a) YOLO Output: Initial Failure to Detect Speed Limit Due to Heavy Rain Weather Condition. b) Step-by-Step human-in-the-loop with Video-LLaVA: Refining Detection with Regioning for guide detection
...and 1 more figures

Human-in-the-loop Reasoning For Traffic Sign Detection: Collaborative Approach Yolo With Video-llava

TL;DR

Abstract

Human-in-the-loop Reasoning For Traffic Sign Detection: Collaborative Approach Yolo With Video-llava

Authors

TL;DR

Abstract

Table of Contents

Figures (6)