Towards a Benchmark for Causal Business Process Reasoning with LLMs
Fabiana Fournier, Lior Limonad, Inna Skarbovsky
TL;DR
This work tackles the problem of evaluating whether LLMs can reason about causal and process aspects of business processes. It introduces BP$^{C}$, a formalism that combines temporal and causal relations and builds a benchmark consisting of template situations, deductive rules, and Yes/No questions across process, causal, and combined perspectives, with an extensible design to cover more domains. The methodology enables both testing and training of LLMs, and the authors provide open-source access to the dataset and prompts. Initial evaluations across multiple LLMs reveal variability in performance by perspective and domain, underscoring the potential and challenges of causally-aware BPM reasoning. The benchmark aims to standardize evaluation, accelerate progress in LLM-driven ABPMS, and scale through community collaboration and domain expansion, ultimately supporting more reliable process interventions and improvements.
Abstract
Large Language Models (LLMs) are increasingly used for boosting organizational efficiency and automating tasks. While not originally designed for complex cognitive processes, recent efforts have further extended to employ LLMs in activities such as reasoning, planning, and decision-making. In business processes, such abilities could be invaluable for leveraging on the massive corpora LLMs have been trained on for gaining deep understanding of such processes. In this work, we plant the seeds for the development of a benchmark to assess the ability of LLMs to reason about causal and process perspectives of business operations. We refer to this view as Causally-augmented Business Processes (BP^C). The core of the benchmark comprises a set of BP^C related situations, a set of questions about these situations, and a set of deductive rules employed to systematically resolve the ground truth answers to these questions. Also with the power of LLMs, the seed is then instantiated into a larger-scale set of domain-specific situations and questions. Reasoning on BP^C is of crucial importance for process interventions and process improvement. Our benchmark, accessible at https://huggingface.co/datasets/ibm/BPC, can be used in one of two possible modalities: testing the performance of any target LLM and training an LLM to advance its capability to reason about BP^C.
