Table of Contents
Fetching ...

EdgeMiner: Distributed Process Mining at the Data Sources

Julia Andersen, Patrick Rathje, Christian Imenkamp, Agnes Koschmider, Olaf Landsiedel

TL;DR

This paper presents EdgeMiner, an algorithm for distributed process mining operating directly on sensor nodes on a stream of real-time event data, which determines predecessors for each event efficiently, reducing the communication overhead by up to 96% compared to querying all nodes.

Abstract

Process mining is moving beyond mining traditional event logs and nowadays includes, for example, data sourced from sensors in the Internet of Things (IoT). The volume and velocity of data generated by such sensors makes it increasingly challenging to efficiently process the data by traditional process discovery algorithms, which operate on a centralized event log. This paper presents EdgeMiner, an algorithm for distributed process mining operating directly on sensor nodes on a stream of real-time event data. In contrast to centralized algorithms, EdgeMiner tracks each event and its predecessor and successor events directly on the sensor node where the event is sensed and recorded. As EdgeMiner aggregates direct successions on the individual nodes, the raw data does not need to be stored centrally, thus improving both scalability and privacy. We analytically and experimentally show the correctness of EdgeMiner. In addition, our evaluation results show that EdgeMiner determines predecessors for each event efficiently, reducing the communication overhead by up to 96% compared to querying all nodes. Further, we show that the number of queried nodes stabilizes after relatively few events, and batching predecessor queries in groups reduces the average queried nodes per event to less than 2.5%.

EdgeMiner: Distributed Process Mining at the Data Sources

TL;DR

This paper presents EdgeMiner, an algorithm for distributed process mining operating directly on sensor nodes on a stream of real-time event data, which determines predecessors for each event efficiently, reducing the communication overhead by up to 96% compared to querying all nodes.

Abstract

Process mining is moving beyond mining traditional event logs and nowadays includes, for example, data sourced from sensors in the Internet of Things (IoT). The volume and velocity of data generated by such sensors makes it increasingly challenging to efficiently process the data by traditional process discovery algorithms, which operate on a centralized event log. This paper presents EdgeMiner, an algorithm for distributed process mining operating directly on sensor nodes on a stream of real-time event data. In contrast to centralized algorithms, EdgeMiner tracks each event and its predecessor and successor events directly on the sensor node where the event is sensed and recorded. As EdgeMiner aggregates direct successions on the individual nodes, the raw data does not need to be stored centrally, thus improving both scalability and privacy. We analytically and experimentally show the correctness of EdgeMiner. In addition, our evaluation results show that EdgeMiner determines predecessors for each event efficiently, reducing the communication overhead by up to 96% compared to querying all nodes. Further, we show that the number of queried nodes stabilizes after relatively few events, and batching predecessor queries in groups reduces the average queried nodes per event to less than 2.5%.
Paper Structure (37 sections, 2 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 37 sections, 2 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: While traditional process mining (left) collects all events at a central entity, EdgeMiner (right) processes them directly at the source, and only exchanges aggregates (partial footprint matrixes), increasing scalability and privacy.
  • Figure 2: Phase 1 -- Event Ordering and Partial Footprint Matrix Construction in EdgeMiner: Without a central entity, nodes determine the order of events collaboratively using message passing. In this example, event 1 is a start event. Therefore, after the node detects the event, it queries the other nodes for the predecessor event. In our example it does not get any positive responses, and, thus, denotes it detected a start event. Upon sensing event 2, the detecting node queries for the predecessor event, receives a response, listing event 1 as predecessor. It stores this information in its local FM.
  • Figure 3: Phase 2 -- Requesting a Footprint Matrix: We request partial FMs and start/end activity flags from all nodes. Upon receiving the data, we concatenate the matrices, form start and end activity sets, and compute the footprint matrix.
  • Figure 4: Average number of nodes queried with and without MFP Requesting, including standard error. MFP Requesting and knowledge of the start events reduce communication demands by a factor of 7.5 to 30 depending on the dataset.
  • Figure 5: Fitness over time. Intermediate FMs quickly converge to the centralized computed FM. The BPIC 2017 dataset, for example, already has a fitness of over 90% after 200 events.
  • ...and 4 more figures

Theorems & Definitions (3)

  • definition 1: Direct Succession
  • definition 2: Causality, No Direct Succession
  • Claim 1