Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces

Zhiling Chen; Hanning Chen; Mohsen Imani; Ruimin Chen; Farhad Imani

Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces

Zhiling Chen, Hanning Chen, Mohsen Imani, Ruimin Chen, Farhad Imani

TL;DR

Clip2Safety tackles PPE compliance in diverse workplaces by integrating scene-aware vision-language prompting with open-vocabulary detection and LLM-guided reasoning. The four-module design—scene recognition, visual prompts, safety-item detection, and fine-grained verification—enables interpretable, attribute-level PPE verification across varied environments. Empirical results across six real-world scenes show state-of-the-art accuracy and an order-of-magnitude speedup over QA-based VLM baselines, with further gains when using LLMs for decision fusion. The approach reduces manual labeling needs and enhances robustness to scene variation, offering practical benefits for real-time safety inspections and regulatory compliance.

Abstract

Workplace accidents due to personal protective equipment (PPE) non-compliance raise serious safety concerns and lead to legal liabilities, financial penalties, and reputational damage. While object detection models have shown the capability to address this issue by identifying safety items, most existing models, such as YOLO, Faster R-CNN, and SSD, are limited in verifying the fine-grained attributes of PPE across diverse workplace scenarios. Vision language models (VLMs) are gaining traction for detection tasks by leveraging the synergy between visual and textual information, offering a promising solution to traditional object detection limitations in PPE recognition. Nonetheless, VLMs face challenges in consistently verifying PPE attributes due to the complexity and variability of workplace environments, requiring them to interpret context-specific language and visual cues simultaneously. We introduce Clip2Safety, an interpretable detection framework for diverse workplace safety compliance, which comprises four main modules: scene recognition, the visual prompt, safety items detection, and fine-grained verification. The scene recognition identifies the current scenario to determine the necessary safety gear. The visual prompt formulates the specific visual prompts needed for the detection process. The safety items detection identifies whether the required safety gear is being worn according to the specified scenario. Lastly, the fine-grained verification assesses whether the worn safety equipment meets the fine-grained attribute requirements. We conduct real-world case studies across six different scenarios. The results show that Clip2Safety not only demonstrates an accuracy improvement over state-of-the-art question-answering based VLMs but also achieves inference times two hundred times faster.

Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces

TL;DR

Abstract

Paper Structure (20 sections, 10 equations, 7 figures, 6 tables)

This paper contains 20 sections, 10 equations, 7 figures, 6 tables.

Introduction
Research Background
Personal Protective Equipment Detection
Vision Language Model
Research Methodology
Scene Recognition Module
Visual Prompt Module
Safety Items Detection Module
Fine-grained Verification Module
Evaluation
Experiment
Dataset
Implementation Details
Compare to state-of-art methods
Ablation Study
...and 5 more sections

Figures (7)

Figure 1: (a) Inadequately defined specifications for safety attire across various scenes. (b) Detailed criteria for fine-grained attributes of different safety attires vary across various scenes (c) Individuals with no safety items only take up a tiny portion of the total samples. (d) Poor image-text embedding when directly using VLMs on original images.
Figure 2: Model Architecture: Step 1: Detect if the person is wearing the required safety items using scene recognition and object detection, paired with VLMs for verification. Step 2: Verify that the detected safety items meet specific attribute requirements by comparing image patches with generated text prompts using a feature-image matching module.
Figure 3: Clip2Safety Visual Prompt Module: Beginning with scene recognition to identify the environment, user prompts are then issued to a large language model to retrieve the necessary safety items and their specific visual features relevant to the recognized scene.
Figure 4: Example images for 6 scenes. (a) Construction site images from Pictor-v3. (b) Chemical Factory images from Safety Detection dataset. (c) Seafood Factory images from PPEs Dataset. (d) Hospital images from CPPE-5 Dataset. (e) Baking Factory images from Safety Detection Dataset. (f) Mechanical Factory images from Safety Detection Dataset.
Figure 5: Example of Benchmarking LLaVA-1.6-7b for Our Safety Detection Task. Step 1 involves asking the model to identify the presence of required safety items by posing yes/no questions. Step 2 focuses on verifying specific attributes of the detected items by asking more detailed questions to ensure compliance with safety standards.
...and 2 more figures

Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces

TL;DR

Abstract

Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces

Authors

TL;DR

Abstract

Table of Contents

Figures (7)