Leveraging Generative AI for Extracting Process Models from Multimodal Documents
Marvin Voelter, Raheleh Hadian, Timotheus Kampik, Marius Breitmayer, Manfred Reichert
TL;DR
The paper tackles auto-generating graphical BPMN process models from multimodal documents (text and images) by building a small multimodal dataset, defining a ground-truth evaluation framework, and evaluating GPT-4V with zero-, one-, and few-shot prompts. It shows that one-shot prompting achieves the strongest average similarity (~0.87) between generated and ground-truth models, indicating feasibility for semi-automated process modeling using off-the-shelf multimodal LLMs. The contributions include a 123-model multimodal dataset with ground-truth BPMN JSON, a structured element-based and semantic evaluation metric based on a Dice variant, and open-source code to enable reproducible benchmarking. The work provides a structured framework for future systematic evaluations and benchmark extensions, guiding improvements in multimodal BPMN extraction and model development.
Abstract
This paper presents an investigation of the capabilities of Generative Pre-trained Transformers (GPTs) to auto-generate graphical process models from multi-modal (i.e., text- and image-based) inputs. More precisely, we first introduce a small dataset as well as a set of evaluation metrics that allow for a ground truth-based evaluation of multi-modal process model generation capabilities. We then conduct an initial evaluation of commercial GPT capabilities using zero-, one-, and few-shot prompting strategies. Our results indicate that GPTs can be useful tools for semi-automated process modeling based on multi-modal inputs. More importantly, the dataset and evaluation metrics as well as the open-source evaluation code provide a structured framework for continued systematic evaluations moving forward.
