Table of Contents
Fetching ...

Visual Language Model as a Judge for Object Detection in Industrial Diagrams

Sanjukta Ghosh

TL;DR

The paper tackles automatic quality assessment of object-detection outputs for dense industrial diagrams (P&IDs) to support digitalization. It introduces a two-stage framework where Visual Language Models act as judges to assess completeness, precision, localization, and classification, using high-level semantic cues and targeted grounding strategies. Grounding methods include visual cues, textual anchors, and geometric coordinates to guide the VLM in identifying missing objects and in refining detections. Experimental results on the PID2GRAPH dataset show substantial mAP gains after refinement, with Gemini-based judging outperforming Gemma-based judging and text-tag plus coordinate grounding yielding the best results. The approach reduces human validation workload and enables scalable, agentic workflows for industrial diagram digitalization.

Abstract

Industrial diagrams such as piping and instrumentation diagrams (P&IDs) are essential for the design, operation, and maintenance of industrial plants. Converting these diagrams into digital form is an important step toward building digital twins and enabling intelligent industrial automation. A central challenge in this digitalization process is accurate object detection. Although recent advances have significantly improved object detection algorithms, there remains a lack of methods to automatically evaluate the quality of their outputs. This paper addresses this gap by introducing a framework that employs Visual Language Models (VLMs) to assess object detection results and guide their refinement. The approach exploits the multimodal capabilities of VLMs to identify missing or inconsistent detections, thereby enabling automated quality assessment and improving overall detection performance on complex industrial diagrams.

Visual Language Model as a Judge for Object Detection in Industrial Diagrams

TL;DR

The paper tackles automatic quality assessment of object-detection outputs for dense industrial diagrams (P&IDs) to support digitalization. It introduces a two-stage framework where Visual Language Models act as judges to assess completeness, precision, localization, and classification, using high-level semantic cues and targeted grounding strategies. Grounding methods include visual cues, textual anchors, and geometric coordinates to guide the VLM in identifying missing objects and in refining detections. Experimental results on the PID2GRAPH dataset show substantial mAP gains after refinement, with Gemini-based judging outperforming Gemma-based judging and text-tag plus coordinate grounding yielding the best results. The approach reduces human validation workload and enables scalable, agentic workflows for industrial diagram digitalization.

Abstract

Industrial diagrams such as piping and instrumentation diagrams (P&IDs) are essential for the design, operation, and maintenance of industrial plants. Converting these diagrams into digital form is an important step toward building digital twins and enabling intelligent industrial automation. A central challenge in this digitalization process is accurate object detection. Although recent advances have significantly improved object detection algorithms, there remains a lack of methods to automatically evaluate the quality of their outputs. This paper addresses this gap by introducing a framework that employs Visual Language Models (VLMs) to assess object detection results and guide their refinement. The approach exploits the multimodal capabilities of VLMs to identify missing or inconsistent detections, thereby enabling automated quality assessment and improving overall detection performance on complex industrial diagrams.

Paper Structure

This paper contains 10 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Typical P&ID detection results.
  • Figure 2: Attention Visualization.
  • Figure 3: Proposed Framework for Object Detection Quality Assessment and Correction.