Leveraging Data Augmentation for Process Information Extraction

Julian Neuberger; Leonie Doll; Benedict Engelmann; Lars Ackermann; Stefan Jablonski

Leveraging Data Augmentation for Process Information Extraction

Julian Neuberger, Leonie Doll, Benedict Engelmann, Lars Ackermann, Stefan Jablonski

TL;DR

The paper tackles data scarcity in business process information extraction from natural language descriptions. It evaluates NLP data-augmentation methods, particularly from the NL-Augmenter framework, on the PET dataset for Mention Detection ($MD$) and Relation Extraction ($RE$). Simple augmentation yields meaningful $F_1$ gains (up to $+2.9$ for MD and $+4.5$ for RE), while large-language-model–based back translation offers little improvement at high computational cost. The authors analyze how augmentations alter text, highlighting three characteristics—linguistic variability, span-length variation, and relation-direction changes—and provide hyperparameter-optimization to identify effective augmentation configurations, with code and results publicly available.

Abstract

Business Process Modeling projects often require formal process models as a central component. High costs associated with the creation of such formal process models motivated many different fields of research aimed at automated generation of process models from readily available data. These include process mining on event logs, and generating business process models from natural language texts. Research in the latter field is regularly faced with the problem of limited data availability, hindering both evaluation and development of new techniques, especially learning-based ones. To overcome this data scarcity issue, in this paper we investigate the application of data augmentation for natural language text data. Data augmentation methods are well established in machine learning for creating new, synthetic data without human assistance. We find that many of these methods are applicable to the task of business process information extraction, improving the accuracy of extraction. Our study shows, that data augmentation is an important component in enabling machine learning methods for the task of business process model generation from natural language text, where currently mostly rule-based systems are still state of the art. Simple data augmentation techniques improved the $F_1$ score of mention extraction by 2.9 percentage points, and the $F_1$ of relation extraction by $4.5$. To better understand how data augmentation alters human annotated texts, we analyze the resulting text, visualizing and discussing the properties of augmented textual data. We make all code and experiments results publicly available.

Leveraging Data Augmentation for Process Information Extraction

TL;DR

) and Relation Extraction (

). Simple augmentation yields meaningful

gains (up to

for MD and

for RE), while large-language-model–based back translation offers little improvement at high computational cost. The authors analyze how augmentations alter text, highlighting three characteristics—linguistic variability, span-length variation, and relation-direction changes—and provide hyperparameter-optimization to identify effective augmentation configurations, with code and results publicly available.

Abstract

score of mention extraction by 2.9 percentage points, and the

of relation extraction by

. To better understand how data augmentation alters human annotated texts, we analyze the resulting text, visualizing and discussing the properties of augmented textual data. We make all code and experiments results publicly available.

Paper Structure (8 sections, 6 figures, 1 table)

This paper contains 8 sections, 6 figures, 1 table.

Introduction
Background
Related Work
Experiment Setup
Data Augmentation Effects
Finding Optimal Configurations
Results
Conclusion and Future Work

Figures (6)

Figure 1: Example for adding noise to an image (left) and to a sentence (right). The image keeps its semantics, while the sentence looses it.
Figure 2: Examples for four different data augmentation techniques. Random deletion, random swap, random insertion (all written in red), are all not preserving the semantics of a sample and its label. Rephrasing (green) is an example for a technique that does.
Figure 3: Example for a fragment of a natural language business process description and its corresponding business process model fragment in BPMN.
Figure 4: Choosing optimal configurations for data augmentation techniques.
Figure 5: Two examples for the effects of data augmentation techniques.
...and 1 more figures

Leveraging Data Augmentation for Process Information Extraction

TL;DR

Abstract

Leveraging Data Augmentation for Process Information Extraction

Authors

TL;DR

Abstract

Table of Contents

Figures (6)