Table of Contents
Fetching ...

Evaluating Cascaded Methods of Vision-Language Models for Zero-Shot Detection and Association of Hardhats for Increased Construction Safety

Lucas Choi, Ross Greer

TL;DR

This paper evaluates the use of vision-language models (VLMs) for zero-shot detection and association of hardhats to enhance construction safety, highlighting the strengths and weaknesses of current foundation models in safety perception domains.

Abstract

This paper evaluates the use of vision-language models (VLMs) for zero-shot detection and association of hardhats to enhance construction safety. Given the significant risk of head injuries in construction, proper enforcement of hardhat use is critical. We investigate the applicability of foundation models, specifically OWLv2, for detecting hardhats in real-world construction site images. Our contributions include the creation of a new benchmark dataset, Hardhat Safety Detection Dataset, by filtering and combining existing datasets and the development of a cascaded detection approach. Experimental results on 5,210 images demonstrate that the OWLv2 model achieves an average precision of 0.6493 for hardhat detection. We further analyze the limitations and potential improvements for real-world applications, highlighting the strengths and weaknesses of current foundation models in safety perception domains.

Evaluating Cascaded Methods of Vision-Language Models for Zero-Shot Detection and Association of Hardhats for Increased Construction Safety

TL;DR

This paper evaluates the use of vision-language models (VLMs) for zero-shot detection and association of hardhats to enhance construction safety, highlighting the strengths and weaknesses of current foundation models in safety perception domains.

Abstract

This paper evaluates the use of vision-language models (VLMs) for zero-shot detection and association of hardhats to enhance construction safety. Given the significant risk of head injuries in construction, proper enforcement of hardhat use is critical. We investigate the applicability of foundation models, specifically OWLv2, for detecting hardhats in real-world construction site images. Our contributions include the creation of a new benchmark dataset, Hardhat Safety Detection Dataset, by filtering and combining existing datasets and the development of a cascaded detection approach. Experimental results on 5,210 images demonstrate that the OWLv2 model achieves an average precision of 0.6493 for hardhat detection. We further analyze the limitations and potential improvements for real-world applications, highlighting the strengths and weaknesses of current foundation models in safety perception domains.

Paper Structure

This paper contains 9 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Example images with ground truth bounding boxes from the Hard Hat Workers Dataset demonstrating the absence of person class annotations
  • Figure 2: Example instances of classes cropped from the SHEL5k dataset. From left to right: Helmet, Head with Helmet, Person with Helmet, Head (Head without Helmet), Person without Helmet, and Face. However, not every object in the SHELF5k dataset receives every annotation it belongs to.
  • Figure 3: Diagram of Cascaded Object and Attribute Detection. From the original image, we detect all instances of persons. Within each person, we detect a head and then detect a helmet within the head. If a helmet detection is made, we classify the head as helmet-wearing. All detections, including helmet detection for the purpose of classification, are performed using OWLv2.
  • Figure 4: Diagram of Nested Object and Attribute Detection. Diagram of Cascaded Object and Attribute Detection. From the original image, we detect all instances of persons. Within each person, we detect a helmet. All detections are performed using OWLv2.
  • Figure 5: Precision-Recall Curve of Person Detection. At higher thresholds, OWLv2 is not able to make enough relevant detections. Throughout lower thresholds, the curve decreases slowly, suggesting high performance in person detection.
  • ...and 3 more figures