Table of Contents
Fetching ...

Are Open-Vocabulary Models Ready for Detection of MEP Elements on Construction Sites

Abdalwhab Abdalwhab, Ali Imran, Sina Heydarian, Ivanka Iordanova, David St-Onge

TL;DR

The paper addresses automatic detection of Mechanical, Electrical, and Plumbing (MEP) components on construction sites using open-vocabulary vision-language models versus fine-tuned lightweight detectors. It introduces a real-world dataset collected with a Journeybot mobile robot, annotates 10 MEP classes, and compares YOLO11 Nano against Grounds-SAM, Grounding-DINO, and DETIC under 0.5 IoU. Findings show that fine-tuned YOLO11 Nano substantially outperforms open-vocabulary models in precision, recall, F1, and real-time applicability, despite the latter's generalization capabilities. The work highlights the current limitations of open-vocabulary models in domain-specific tasks and suggests directions like targeted fine-tuning and task-oriented prompting to close the gap.

Abstract

The construction industry has long explored robotics and computer vision, yet their deployment on construction sites remains very limited. These technologies have the potential to revolutionize traditional workflows by enhancing accuracy, efficiency, and safety in construction management. Ground robots equipped with advanced vision systems could automate tasks such as monitoring mechanical, electrical, and plumbing (MEP) systems. The present research evaluates the applicability of open-vocabulary vision-language models compared to fine-tuned, lightweight, closed-set object detectors for detecting MEP components using a mobile ground robotic platform. A dataset collected with cameras mounted on a ground robot was manually annotated and analyzed to compare model performance. The results demonstrate that, despite the versatility of vision-language models, fine-tuned lightweight models still largely outperform them in specialized environments and for domain-specific tasks.

Are Open-Vocabulary Models Ready for Detection of MEP Elements on Construction Sites

TL;DR

The paper addresses automatic detection of Mechanical, Electrical, and Plumbing (MEP) components on construction sites using open-vocabulary vision-language models versus fine-tuned lightweight detectors. It introduces a real-world dataset collected with a Journeybot mobile robot, annotates 10 MEP classes, and compares YOLO11 Nano against Grounds-SAM, Grounding-DINO, and DETIC under 0.5 IoU. Findings show that fine-tuned YOLO11 Nano substantially outperforms open-vocabulary models in precision, recall, F1, and real-time applicability, despite the latter's generalization capabilities. The work highlights the current limitations of open-vocabulary models in domain-specific tasks and suggests directions like targeted fine-tuning and task-oriented prompting to close the gap.

Abstract

The construction industry has long explored robotics and computer vision, yet their deployment on construction sites remains very limited. These technologies have the potential to revolutionize traditional workflows by enhancing accuracy, efficiency, and safety in construction management. Ground robots equipped with advanced vision systems could automate tasks such as monitoring mechanical, electrical, and plumbing (MEP) systems. The present research evaluates the applicability of open-vocabulary vision-language models compared to fine-tuned, lightweight, closed-set object detectors for detecting MEP components using a mobile ground robotic platform. A dataset collected with cameras mounted on a ground robot was manually annotated and analyzed to compare model performance. The results demonstrate that, despite the versatility of vision-language models, fine-tuned lightweight models still largely outperform them in specialized environments and for domain-specific tasks.
Paper Structure (4 sections, 3 figures, 2 tables)

This paper contains 4 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Our Journeybot platform for data collection: a Jackal rover base from Clearpath equipped with several cameras and LiDARs (not used in this study).
  • Figure 2: YOLO11 normalized confusion matrix for its predictions on the testing dataset split.
  • Figure 3: Qualitative examples of YOLO11 Nano's performance, displaying predicted classes, bounding boxes, and confidence scores.