Evaluating Vision-Language Models for Zero-Shot Detection, Classification, and Association of Motorcycles, Passengers, and Helmets

Lucas Choi; Ross Greer

Evaluating Vision-Language Models for Zero-Shot Detection, Classification, and Association of Motorcycles, Passengers, and Helmets

Lucas Choi, Ross Greer

TL;DR

The paper investigates applying a vision-language foundation model, OWLv2, to zero-shot detect motorcycles, riders, helmet status, and seating in video data to support road-safety enforcement. It proposes a cascaded detection pipeline combining OWLv2 for object and attribute detection with an AlexNet-based seat classifier, addressing dataset biases and incomplete annotations. The helmet-detection performance reaches an average precision of $AP_{helmet}=0.5324$, while motorcycle and person detections lag with $AP_{motorcycle}=0.4122$ and $AP_{person}=0.3561$, and a threshold-sensitive evaluation using an IoU criterion of $IoU \ge 0.5$ is reported. The results illustrate the potential of zero-shot, language-grounded perception for real-world safety systems and point to future work on broader pretraining, model fusion, and improved localization to enable robust I2V safety communications.

Abstract

Motorcycle accidents pose significant risks, particularly when riders and passengers do not wear helmets. This study evaluates the efficacy of an advanced vision-language foundation model, OWLv2, in detecting and classifying various helmet-wearing statuses of motorcycle occupants using video data. We extend the dataset provided by the CVPR AI City Challenge and employ a cascaded model approach for detection and classification tasks, integrating OWLv2 and CNN models. The results highlight the potential of zero-shot learning to address challenges arising from incomplete and biased training datasets, demonstrating the usage of such models in detecting motorcycles, helmet usage, and occupant positions under varied conditions. We have achieved an average precision of 0.5324 for helmet detection and provided precision-recall curves detailing the detection and classification performance. Despite limitations such as low-resolution data and poor visibility, our research shows promising advancements in automated vehicle safety and traffic safety enforcement systems.

Evaluating Vision-Language Models for Zero-Shot Detection, Classification, and Association of Motorcycles, Passengers, and Helmets

TL;DR

, while motorcycle and person detections lag with

and

, and a threshold-sensitive evaluation using an IoU criterion of

is reported. The results illustrate the potential of zero-shot, language-grounded perception for real-world safety systems and point to future work on broader pretraining, model fusion, and improved localization to enable robust I2V safety communications.

Abstract

Paper Structure (8 sections, 7 figures, 4 tables)

This paper contains 8 sections, 7 figures, 4 tables.

Introduction
Related Research
Algorithms for Image Processing with Vision-Language Detection
Experimental Method and Evaluation
Data
Results
Sensitivity of Helmet Detection to OWLViT Detection Threshold
Concluding Remarks and Future Research

Figures (7)

Figure 1: Example instances of classes to detect, cropped from the AI City Challenge dataset. From left to right: Motorcycle, Driver with Helmet, Driver with No Helmet, Child Passenger with No Helmet, Passenger 1 with Helmet, Passenger 1 with No Helmet, Passenger 2 with No Helmet.
Figure 2: Our algorithm for detecting the relevant objects for helmet safety, as well as the appropriate attributes, acts in a cascaded style. First, from the original image, we detect all motorcycles. Then, within each motorcycle, we detect all human occupants (drivers and passengers). Then, for each detected human, we perform helmet detection and seat position classification. All detections, including helmet detection for the purpose of classification, are done using OWLViT2, while seat position classification is done using AlexNet.
Figure 3: Sample images of the dataset of different angles with different environments. From top to bottom: night, foggy, crowded
Figure 4: Precision-Recall Curve of Motorcycle Detection. Initially, a slight increase in precision indicates improved confidence in early predictions. However, precision declines steeply as recall rises, highlighting the model’s challenge in maintaining accuracy while capturing more true positives.
Figure 5: Precision-Recall Curve of Passenger Detection. The curve demonstrates a high precision at low recall values. Despite the trade-off of precision and recall, the shape suggests a robust model performance in balancing the two.
...and 2 more figures

Evaluating Vision-Language Models for Zero-Shot Detection, Classification, and Association of Motorcycles, Passengers, and Helmets

TL;DR

Abstract

Evaluating Vision-Language Models for Zero-Shot Detection, Classification, and Association of Motorcycles, Passengers, and Helmets

Authors

TL;DR

Abstract

Table of Contents

Figures (7)