Table of Contents
Fetching ...

Automated Hazard Detection in Construction Sites Using Large Language and Vision-Language Models

Islem Sahraoui

TL;DR

This work addresses automated hazard detection in construction by marrying textual analysis of OSHA narratives with visual reasoning on site imagery through a prompt-driven multimodal framework. It deploys two pipelines: a textual pipeline using GPT-4o-mini to extract structured fields and classify incidents, and a visual pipeline using GPT-4o Vision to describe scenes, predict hazards, and localize risks, complemented by an open-source evaluation with Molmo 7B and Qwen2 VL 2B on ConstructionSite-10K. Results show an 89% accuracy in text classification and competitive PPE-rule detection with lightweight models when using prompt ensembles, demonstrating feasibility of cost-effective, scalable safety analytics without fine-tuning. The findings highlight practical implications for safety management, including potential integration with BIM and live monitoring, while acknowledging limitations such as data scale, generalization, and the need for broader validation and field testing.

Abstract

This thesis explores a multimodal AI framework for enhancing construction safety through the combined analysis of textual and visual data. In safety-critical environments such as construction sites, accident data often exists in multiple formats, such as written reports, inspection records, and site imagery, making it challenging to synthesize hazards using traditional approaches. To address this, this thesis proposed a multimodal AI framework that combines text and image analysis to assist in identifying safety hazards on construction sites. Two case studies were consucted to evaluate the capabilities of large language models (LLMs) and vision-language models (VLMs) for automated hazard identification.The first case study introduces a hybrid pipeline that utilizes GPT 4o and GPT 4o mini to extract structured insights from a dataset of 28,000 OSHA accident reports (2000-2025). The second case study extends this investigation using Molmo 7B and Qwen2 VL 2B, lightweight, open-source VLMs. Using the public ConstructionSite10k dataset, the performance of the two models was evaluated on rule-level safety violation detection using natural language prompts. This experiment served as a cost-aware benchmark against proprietary models and allowed testing at scale with ground-truth labels. Despite their smaller size, Molmo 7B and Quen2 VL 2B showed competitive performance in certain prompt configurations, reinforcing the feasibility of low-resource multimodal systems for rule-aware safety monitoring.

Automated Hazard Detection in Construction Sites Using Large Language and Vision-Language Models

TL;DR

This work addresses automated hazard detection in construction by marrying textual analysis of OSHA narratives with visual reasoning on site imagery through a prompt-driven multimodal framework. It deploys two pipelines: a textual pipeline using GPT-4o-mini to extract structured fields and classify incidents, and a visual pipeline using GPT-4o Vision to describe scenes, predict hazards, and localize risks, complemented by an open-source evaluation with Molmo 7B and Qwen2 VL 2B on ConstructionSite-10K. Results show an 89% accuracy in text classification and competitive PPE-rule detection with lightweight models when using prompt ensembles, demonstrating feasibility of cost-effective, scalable safety analytics without fine-tuning. The findings highlight practical implications for safety management, including potential integration with BIM and live monitoring, while acknowledging limitations such as data scale, generalization, and the need for broader validation and field testing.

Abstract

This thesis explores a multimodal AI framework for enhancing construction safety through the combined analysis of textual and visual data. In safety-critical environments such as construction sites, accident data often exists in multiple formats, such as written reports, inspection records, and site imagery, making it challenging to synthesize hazards using traditional approaches. To address this, this thesis proposed a multimodal AI framework that combines text and image analysis to assist in identifying safety hazards on construction sites. Two case studies were consucted to evaluate the capabilities of large language models (LLMs) and vision-language models (VLMs) for automated hazard identification.The first case study introduces a hybrid pipeline that utilizes GPT 4o and GPT 4o mini to extract structured insights from a dataset of 28,000 OSHA accident reports (2000-2025). The second case study extends this investigation using Molmo 7B and Qwen2 VL 2B, lightweight, open-source VLMs. Using the public ConstructionSite10k dataset, the performance of the two models was evaluated on rule-level safety violation detection using natural language prompts. This experiment served as a cost-aware benchmark against proprietary models and allowed testing at scale with ground-truth labels. Despite their smaller size, Molmo 7B and Quen2 VL 2B showed competitive performance in certain prompt configurations, reinforcing the feasibility of low-resource multimodal systems for rule-aware safety monitoring.

Paper Structure

This paper contains 24 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overall methodology showing three parallel pipelines : (a) Textual pipeline using GPT-4o-mini, (b) Visual pipeline using GPT-4o Vision, and (c) Open-source visual pipeline using Molmo-7B and Qwen2-VL-2B.
  • Figure 2: Pipeline prompt.
  • Figure 3: Scene description prompt.
  • Figure 4: Accident Scenario Prediction Prompt.
  • Figure 5: High-risk hazard filtering prompt.
  • ...and 4 more figures