Table of Contents
Fetching ...

Attention Please: What Transformer Models Really Learn for Process Prediction

Martin Käppel, Lars Ackermann, Stefan Jablonski, Simon Härtl

TL;DR

This paper investigates whether transformer attention scores can explain next-activity predictions in predictive business process monitoring (PBPM). It introduces two global, attention-based explanation approaches—Backward Explainer and Attention Exploration Explainer—and evaluates them across eight real-world event logs with multiple metrics, showing that attention scores reliably reflect decision factors and can be used to build meaningful, global explanations. The findings indicate strong explanation quality in terms of correctness and continuity, though limitations arise from the model’s reliance on the activity attribute and dataset quality. The work provides a foundation for trust and iterative model improvement in transformer-based PBPM and points to future directions in more nuanced masking, thresholding, and automatic process model discovery from prediction models.

Abstract

Predictive process monitoring aims to support the execution of a process during runtime with various predictions about the further evolution of a process instance. In the last years a plethora of deep learning architectures have been established as state-of-the-art for different prediction targets, among others the transformer architecture. The transformer architecture is equipped with a powerful attention mechanism, assigning attention scores to each input part that allows to prioritize most relevant information leading to more accurate and contextual output. However, deep learning models largely represent a black box, i.e., their reasoning or decision-making process cannot be understood in detail. This paper examines whether the attention scores of a transformer based next-activity prediction model can serve as an explanation for its decision-making. We find that attention scores in next-activity prediction models can serve as explainers and exploit this fact in two proposed graph-based explanation approaches. The gained insights could inspire future work on the improvement of predictive business process models as well as enabling a neural network based mining of process models from event logs.

Attention Please: What Transformer Models Really Learn for Process Prediction

TL;DR

This paper investigates whether transformer attention scores can explain next-activity predictions in predictive business process monitoring (PBPM). It introduces two global, attention-based explanation approaches—Backward Explainer and Attention Exploration Explainer—and evaluates them across eight real-world event logs with multiple metrics, showing that attention scores reliably reflect decision factors and can be used to build meaningful, global explanations. The findings indicate strong explanation quality in terms of correctness and continuity, though limitations arise from the model’s reliance on the activity attribute and dataset quality. The work provides a foundation for trust and iterative model improvement in transformer-based PBPM and points to future directions in more nuanced masking, thresholding, and automatic process model discovery from prediction models.

Abstract

Predictive process monitoring aims to support the execution of a process during runtime with various predictions about the further evolution of a process instance. In the last years a plethora of deep learning architectures have been established as state-of-the-art for different prediction targets, among others the transformer architecture. The transformer architecture is equipped with a powerful attention mechanism, assigning attention scores to each input part that allows to prioritize most relevant information leading to more accurate and contextual output. However, deep learning models largely represent a black box, i.e., their reasoning or decision-making process cannot be understood in detail. This paper examines whether the attention scores of a transformer based next-activity prediction model can serve as an explanation for its decision-making. We find that attention scores in next-activity prediction models can serve as explainers and exploit this fact in two proposed graph-based explanation approaches. The gained insights could inspire future work on the improvement of predictive business process models as well as enabling a neural network based mining of process models from event logs.
Paper Structure (20 sections, 5 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 20 sections, 5 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Attention scores of two different heads for a prefix of the BPIC12 event log.
  • Figure 2: JSD vs. TVD plots. Rectangles = models with frozen weights, triangles = seeded base models. JSD can only take values between 0 and $log(2) = 0.693$.
  • Figure 3: Left: Variation between predictions TVD between masked elements in the prefix and only masked attention scores matrix. Right: Used masking variants.
  • Figure 4: Determining aggregated attention scores for each event in the prefix, assuming that $\mathcal{M}$ possesses two heads.
  • Figure 5: Example for BackwardExplainer. Relevant activities and likely next activities are highlighted bold faced in red color in the input prefix and prediction vector.

Theorems & Definitions (1)

  • definition thmcounterdefinition: Prefix