Evaluating the Process Modeling Abilities of Large Language Models -- Preliminary Foundations and Results

Peter Fettke; Constantin Houy

Evaluating the Process Modeling Abilities of Large Language Models -- Preliminary Foundations and Results

Peter Fettke, Constantin Houy

TL;DR

The paper addresses the challenge of rigorously evaluating the process modeling abilities of large language models (LLMs) given multi-criteria quality and non-deterministic outputs. It proposes a standard evaluation scenario with plain-English domain descriptions and BPMN-ground-truth benchmarks to enable objective comparisons, while acknowledging limitations and potential relaxations. A key contribution is the explicit treatment of cost and time alongside quality, introducing Pareto-front analyses to characterize trade-offs between these dimensions. It presents preliminary results and an illustrative example showing how token usage costs influence evaluation, and discusses risks such as data leakage and generalization. The work outlines future directions, including broader modeling languages, multimodal inputs, and interactive, domain-specific evaluation pipelines to strengthen rigorous assessment of LLM-assisted process modeling.

Abstract

Large language models (LLM) have revolutionized the processing of natural language. Although first benchmarks of the process modeling abilities of LLM are promising, it is currently under debate to what extent an LLM can generate good process models. In this contribution, we argue that the evaluation of the process modeling abilities of LLM is far from being trivial. Hence, available evaluation results must be taken carefully. For example, even in a simple scenario, not only the quality of a model should be taken into account, but also the costs and time needed for generation. Thus, an LLM does not generate one optimal solution, but a set of Pareto-optimal variants. Moreover, there are several further challenges which have to be taken into account, e.g. conceptualization of quality, validation of results, generalizability, and data leakage. We discuss these challenges in detail and discuss future experiments to tackle these challenges scientifically.

Evaluating the Process Modeling Abilities of Large Language Models -- Preliminary Foundations and Results

TL;DR

Abstract

Evaluating the Process Modeling Abilities of Large Language Models -- Preliminary Foundations and Results

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)