Table of Contents
Fetching ...

Beyond Rule-based Named Entity Recognition and Relation Extraction for Process Model Generation from Natural Language Text

Julian Neuberger, Lars Ackermann, Stefan Jablonski

TL;DR

The paper addresses the challenge of automatically generating formal business process descriptions from natural language by advancing a data-driven extraction pipeline that extends the PET dataset with linguistic reference identities. It combines a CRF-based NER, neural coreference-driven ER, and a gradient-boosted RE (BoostRelEx) to extract process elements and relations, outperforming a rule-based baseline on several relation types, though constrained by PET's small size for deep learning methods. Key contributions include extending PET with element identity information, establishing neural ER as a strong baseline over naive matching, and introducing BoostRelEx with explicit analysis of error propagation and relation-distance effects. The work demonstrates that data-driven BPM information extraction is feasible and adaptable to new domains, while highlighting the need for joint models and in-domain or LLM-based enhancements to overcome dataset limitations and improve downstream integration.

Abstract

Process-aware information systems offer extensive advantages to companies, facilitating planning, operations, and optimization of day-to-day business activities. However, the time-consuming but required step of designing formal business process models often hampers the potential of these systems. To overcome this challenge, automated generation of business process models from natural language text has emerged as a promising approach to expedite this step. Generally two crucial subtasks have to be solved: extracting process-relevant information from natural language and creating the actual model. Approaches towards the first subtask are rule based methods, highly optimized for specific domains, but hard to adapt to related applications. To solve this issue, we present an extension to an existing pipeline, to make it entirely data driven. We demonstrate the competitiveness of our improved pipeline, which not only eliminates the substantial overhead associated with feature engineering and rule definition, but also enables adaptation to different datasets, entity and relation types, and new domains. Additionally, the largest available dataset (PET) for the first subtask, contains no information about linguistic references between mentions of entities in the process description. Yet, the resolution of these mentions into a single visual element is essential for high quality process models. We propose an extension to the PET dataset that incorporates information about linguistic references and a corresponding method for resolving them. Finally, we provide a detailed analysis of the inherent challenges in the dataset at hand.

Beyond Rule-based Named Entity Recognition and Relation Extraction for Process Model Generation from Natural Language Text

TL;DR

The paper addresses the challenge of automatically generating formal business process descriptions from natural language by advancing a data-driven extraction pipeline that extends the PET dataset with linguistic reference identities. It combines a CRF-based NER, neural coreference-driven ER, and a gradient-boosted RE (BoostRelEx) to extract process elements and relations, outperforming a rule-based baseline on several relation types, though constrained by PET's small size for deep learning methods. Key contributions include extending PET with element identity information, establishing neural ER as a strong baseline over naive matching, and introducing BoostRelEx with explicit analysis of error propagation and relation-distance effects. The work demonstrates that data-driven BPM information extraction is feasible and adaptable to new domains, while highlighting the need for joint models and in-domain or LLM-based enhancements to overcome dataset limitations and improve downstream integration.

Abstract

Process-aware information systems offer extensive advantages to companies, facilitating planning, operations, and optimization of day-to-day business activities. However, the time-consuming but required step of designing formal business process models often hampers the potential of these systems. To overcome this challenge, automated generation of business process models from natural language text has emerged as a promising approach to expedite this step. Generally two crucial subtasks have to be solved: extracting process-relevant information from natural language and creating the actual model. Approaches towards the first subtask are rule based methods, highly optimized for specific domains, but hard to adapt to related applications. To solve this issue, we present an extension to an existing pipeline, to make it entirely data driven. We demonstrate the competitiveness of our improved pipeline, which not only eliminates the substantial overhead associated with feature engineering and rule definition, but also enables adaptation to different datasets, entity and relation types, and new domains. Additionally, the largest available dataset (PET) for the first subtask, contains no information about linguistic references between mentions of entities in the process description. Yet, the resolution of these mentions into a single visual element is essential for high quality process models. We propose an extension to the PET dataset that incorporates information about linguistic references and a corresponding method for resolving them. Finally, we provide a detailed analysis of the inherent challenges in the dataset at hand.
Paper Structure (13 sections, 8 figures, 1 table)

This paper contains 13 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Example for differences between information extraction phase with and without resolving process element identities. Resolving process element identity from their mentions (right) allows generation of correct data flow, without (left) data flow is disjointed.
  • Figure 2: Outline of our proposed extended extraction pipeline.
  • Figure 3: Example for our ER method based on a pretrained end-to-end neural coreference resolver. Predicted coreferent text spans a claim and it are accepted and resolved to an entity containing the mentions claim and it, since both text spans overlap at least 50% with the mention's texts.
  • Figure 4: Values of metrics $P$, $R$, and $F_{1}$ for different negative sampling rates $r_n$.
  • Figure 5: \ref{['fig:pet-relation-heatmap']} shows the number of relations aggregated by argument types denoted with $\emph{head}\rightarrow\emph{tail}$. Only combinations where at least one relation exists are shown. \ref{['fig:pet-entity-ttr']} shows the mean type-token ratio for mention clusters with at least two mentions.
  • ...and 3 more figures