Table of Contents
Fetching ...

Dynamic and Scalable Data Preparation for Object-Centric Process Mining

Lien Bosmans, Jari Peeperkorn, Alexandre Goossens, Giovanni Lugaresi, Johannes De Smedt, Jochen De Weerdt

TL;DR

A database format designed for an intermediate data storage hub is proposed, which segregates process mining applications from their data sources using a hub-and-spoke architecture and introduces a novel relational schema tailored to these requirements.

Abstract

Object-centric process mining is emerging as a promising paradigm across diverse industries, drawing substantial academic attention. To support its data requirements, existing object-centric data formats primarily facilitate the exchange of static event logs between data owners, researchers, and analysts, rather than serving as a robust foundational data model for continuous data ingestion and transformation pipelines for subsequent storage and analysis. This focus results into suboptimal design choices in terms of flexibility, scalability, and maintainability. For example, it is difficult for current object-centric event log formats to deal with novel object types or new attributes in case of streaming data. This paper proposes a database format designed for an intermediate data storage hub, which segregates process mining applications from their data sources using a hub-and-spoke architecture. It delineates essential requirements for robust object-centric event log storage from a data engineering perspective and introduces a novel relational schema tailored to these requirements. To validate the efficacy of the proposed database format, an end-to-end solution is implemented using a lightweight, open-source data stack. Our implementation includes data extractors for various object-centric event log formats, automated data quality assessments, and intuitive process data visualization capabilities.

Dynamic and Scalable Data Preparation for Object-Centric Process Mining

TL;DR

A database format designed for an intermediate data storage hub is proposed, which segregates process mining applications from their data sources using a hub-and-spoke architecture and introduces a novel relational schema tailored to these requirements.

Abstract

Object-centric process mining is emerging as a promising paradigm across diverse industries, drawing substantial academic attention. To support its data requirements, existing object-centric data formats primarily facilitate the exchange of static event logs between data owners, researchers, and analysts, rather than serving as a robust foundational data model for continuous data ingestion and transformation pipelines for subsequent storage and analysis. This focus results into suboptimal design choices in terms of flexibility, scalability, and maintainability. For example, it is difficult for current object-centric event log formats to deal with novel object types or new attributes in case of streaming data. This paper proposes a database format designed for an intermediate data storage hub, which segregates process mining applications from their data sources using a hub-and-spoke architecture. It delineates essential requirements for robust object-centric event log storage from a data engineering perspective and introduces a novel relational schema tailored to these requirements. To validate the efficacy of the proposed database format, an end-to-end solution is implemented using a lightweight, open-source data stack. Our implementation includes data extractors for various object-centric event log formats, automated data quality assessments, and intuitive process data visualization capabilities.
Paper Structure (28 sections, 7 figures)

This paper contains 28 sections, 7 figures.

Figures (7)

  • Figure 1: Example of point-to-point (left) vs. hub-and-spoke (right) architecture. In a point-to-point architecture, every application (rectangle) is directly connected to the required data sources (oval). A hub-and-spoke architecture introduces an abstraction layer (hub) that separates data sources from their use cases. Only one connection (spoke) needs to be updated when changes are made to a data source or application.
  • Figure 2: Meta-model of the proposed storage hub.
  • Figure 3: Relational schema for the proposed object-centric event data storage hub, consisting of event tables (top), object tables (bottom) and relation tables (gray).
  • Figure 4: Example of interactive process data visualizations in Neo4j.
  • Figure B.5: Current prototype of the process data visualization.
  • ...and 2 more figures