Table of Contents
Fetching ...

Structured Extraction from Business Process Diagrams Using Vision-Language Models

Pritam Deka, Barry Devereux

TL;DR

<3-5 sentence high-level summary> This work tackles extracting structured BPMN representations from diagram images without source XML files by leveraging Vision-Language Models (VLMs) guided by carefully designed prompts, with optional OCR-based enrichment. It introduces a BPMN-focused, zero-shot prompt scheme, a parallel OCR enrichment stage, and a parsing/validation pipeline to produce JSON outputs aligned to ground-truth BPMN XML. A newly constructed BPMN-VLM dataset with 202 diagram-XML pairs supports rigorous, prompt-based evaluation across multiple VLMs, OCR configurations, and strict vs relaxed criteria. The study reveals that top-tier models excel onvision-only extractions, OCR benefits mid-tier models, and a DFS+BFS prompt variant often yields the strongest, more robust performance for complex BPMN structures. This work enables image-based process understanding and automated extraction for scenarios where source BPMN files are unavailable, with implications for process mining and diagram-to-XML translation.

Abstract

Business Process Model and Notation (BPMN) is a widely adopted standard for representing complex business workflows. While BPMN diagrams are often exchanged as visual images, existing methods primarily rely on XML representations for computational analysis. In this work, we present a pipeline that leverages Vision-Language Models (VLMs) to extract structured JSON representations of BPMN diagrams directly from images, without requiring source model files or textual annotations. We also incorporate optical character recognition (OCR) for textual enrichment and evaluate the generated element lists against ground truth data derived from the source XML files. Our approach enables robust component extraction in scenarios where original source files are unavailable. We benchmark multiple VLMs and observe performance improvements in several models when OCR is used for text enrichment. In addition, we conducted extensive statistical analyses of OCR-based enrichment methods and prompt ablation studies, providing a clearer understanding of their impact on model performance.

Structured Extraction from Business Process Diagrams Using Vision-Language Models

TL;DR

<3-5 sentence high-level summary> This work tackles extracting structured BPMN representations from diagram images without source XML files by leveraging Vision-Language Models (VLMs) guided by carefully designed prompts, with optional OCR-based enrichment. It introduces a BPMN-focused, zero-shot prompt scheme, a parallel OCR enrichment stage, and a parsing/validation pipeline to produce JSON outputs aligned to ground-truth BPMN XML. A newly constructed BPMN-VLM dataset with 202 diagram-XML pairs supports rigorous, prompt-based evaluation across multiple VLMs, OCR configurations, and strict vs relaxed criteria. The study reveals that top-tier models excel onvision-only extractions, OCR benefits mid-tier models, and a DFS+BFS prompt variant often yields the strongest, more robust performance for complex BPMN structures. This work enables image-based process understanding and automated extraction for scenarios where source BPMN files are unavailable, with implications for process mining and diagram-to-XML translation.

Abstract

Business Process Model and Notation (BPMN) is a widely adopted standard for representing complex business workflows. While BPMN diagrams are often exchanged as visual images, existing methods primarily rely on XML representations for computational analysis. In this work, we present a pipeline that leverages Vision-Language Models (VLMs) to extract structured JSON representations of BPMN diagrams directly from images, without requiring source model files or textual annotations. We also incorporate optical character recognition (OCR) for textual enrichment and evaluate the generated element lists against ground truth data derived from the source XML files. Our approach enables robust component extraction in scenarios where original source files are unavailable. We benchmark multiple VLMs and observe performance improvements in several models when OCR is used for text enrichment. In addition, we conducted extensive statistical analyses of OCR-based enrichment methods and prompt ablation studies, providing a clearer understanding of their impact on model performance.

Paper Structure

This paper contains 28 sections, 10 equations, 3 figures, 10 tables, 1 algorithm.

Figures (3)

  • Figure 1: Cropped BPMN region with fine-grained cues
  • Figure 2: Average model ranks across evaluation settings based on the Friedman test. Lower ranks indicate better performance.
  • Figure 3: Effect sizes (Cohen's $d$) for OCR vs. VLM-only across models and OCR methods. Positive values indicate performance gains from OCR.