An Experimental Comparison of Alternative Techniques for Event-Log Augmentation
Alessandro Padella, Francesco Vinci, Massimiliano de Leoni
TL;DR
Process mining relies on event logs, yet data scarcity hampers machine- and deep-learning approaches. The authors perform an extensive empirical comparison of seven event-log augmentation techniques against a baseline based on a probabilistic transition system, across eight logs, evaluating similarity, predictive preservation, entropy-driven diversity, and computation time. They find that the baseline and RIMS excel in different facets—baseline in control-flow, resources, and speed, and RIMS in time- and congestion-related accuracy—while CVAE boosts control-flow similarity at the cost of computation, and SMOTE generally underperforms due to ignoring process constraints. The study suggests that combining fast, accurate control-flow generation with detailed resource-time modeling could yield high-fidelity synthetic logs with strong utility for predictive process monitoring.
Abstract
Process mining analyzes and improves processes by examining transactional data stored in event logs, which record sequences of events with timestamps. However, the effectiveness of process mining, especially when combined with machine or deep learning, depends on having large event logs. Event log augmentation addresses this limitation by generating additional traces that simulate realistic process executions while considering various perspectives like time, control-flow, workflow, resources, and domain-specific attributes. Although prior research has explored event-log augmentation techniques, there has been no comprehensive comparison of their effectiveness. This paper reports on an evaluation of seven state-of-the-art augmentation techniques across eight event logs. The results are also compared with those obtained by a baseline technique based on a stochastic transition system. The comparison has been carried on analyzing four different aspects: similarity, preservation of predictive information, information loss/enhancement, and computational times required. Results show that, considering the different criteria, a technique based on a stochastic transition system combined with resource queue modeling would provide higher quality synthetic event logs. Event-log augmentation techniques are also compared with traditional data-augmentation techniques, showing that the former provide significant benefits, whereas the latter fail to consider process constraints.
