Towards Reproducibility in Predictive Process Mining: SPICE - A Deep Learning Library
Oliver Stritzel, Nick Hühnerbein, Simon Rauch, Itzel Zarate, Lukas Fleischmann, Moike Buck, Attila Lischka, Christian Frey
TL;DR
This work tackles reproducibility challenges in Predictive Process Mining by proposing SPICE, an open-source PyTorch framework that reimplements three prominent deep-learning baselines (Tax2017, Camargo2019, ProcessTransformer) for tasks including Next Activity, Next Timestamp, Suffix, and Remaining Time. SPICE standardizes data splitting, preprocessing, and evaluation to enable fair, cross-dataset comparisons and supports multi-task and autoregressive predictions with configurable samplers and experiment tracking. The authors critique common experimental design flaws in PPM literature, demonstrate reimplementation details, and provide a centralized platform for ablation studies and future benchmarking. They show results across 11 datasets, highlighting reproducibility gains and remaining challenges in achieving faithful metric reproduction. The work aims to move PPM research toward trustworthy baselines and practical applicability through rigorous tooling and reproducible benchmarks.
Abstract
In recent years, Predictive Process Mining (PPM) techniques based on artificial neural networks have evolved as a method for monitoring the future behavior of unfolding business processes and predicting Key Performance Indicators (KPIs). However, many PPM approaches often lack reproducibility, transparency in decision making, usability for incorporating novel datasets and benchmarking, making comparisons among different implementations very difficult. In this paper, we propose SPICE, a Python framework that reimplements three popular, existing baseline deep-learning-based methods for PPM in PyTorch, while designing a common base framework with rigorous configurability to enable reproducible and robust comparison of past and future modelling approaches. We compare SPICE to original reported metrics and with fair metrics on 11 datasets.
