Table of Contents
Fetching ...

Applied Machine Learning to Anomaly Detection in Enterprise Purchase Processes

A. Herreros-Martínez, R. Magdalena-Benedicto, J. Vila-Francés, A. J. Serrano-López, S. Pérez-Díaz

TL;DR

The paper tackles anomaly detection in enterprise purchase processes under unlabeled data by combining univariate (z-score, DBSCAN) and multivariate (k-Means with categorical encodings, Isolation Forest) techniques. It proposes an ensemble prioritisation to rank anomalous transactions and integrates explainability (SHAP/LIME) to aid auditors. Results show univariate methods yield manageable candidate sets, while k-Means clustering often exhibits weak structure (SSE/Silhouette) under tested configurations, with Isolation Forest providing complementary signals. The approach is implemented in KNIME as a reproducible workflow, offering practical value for automated auditing and pointing to future work on richer encodings and additional clustering methods for stronger anomaly characterization.

Abstract

In a context of a continuous digitalisation of processes, organisations must deal with the challenge of detecting anomalies that can reveal suspicious activities upon an increasing volume of data. To pursue this goal, audit engagements are carried out regularly, and internal auditors and purchase specialists are constantly looking for new methods to automate these processes. This work proposes a methodology to prioritise the investigation of the cases detected in two large purchase datasets from real data. The goal is to contribute to the effectiveness of the companies' control efforts and to increase the performance of carrying out such tasks. A comprehensive Exploratory Data Analysis is carried out before using unsupervised Machine Learning techniques addressed to detect anomalies. A univariate approach has been applied through the z-Score index and the DBSCAN algorithm, while a multivariate analysis is implemented with the k-Means and Isolation Forest algorithms, and the Silhouette index, resulting in each method having a transaction candidates' proposal to be reviewed. An ensemble prioritisation of the candidates is provided jointly with a proposal of explicability methods (LIME, Shapley, SHAP) to help the company specialists in their understanding.

Applied Machine Learning to Anomaly Detection in Enterprise Purchase Processes

TL;DR

The paper tackles anomaly detection in enterprise purchase processes under unlabeled data by combining univariate (z-score, DBSCAN) and multivariate (k-Means with categorical encodings, Isolation Forest) techniques. It proposes an ensemble prioritisation to rank anomalous transactions and integrates explainability (SHAP/LIME) to aid auditors. Results show univariate methods yield manageable candidate sets, while k-Means clustering often exhibits weak structure (SSE/Silhouette) under tested configurations, with Isolation Forest providing complementary signals. The approach is implemented in KNIME as a reproducible workflow, offering practical value for automated auditing and pointing to future work on richer encodings and additional clustering methods for stronger anomaly characterization.

Abstract

In a context of a continuous digitalisation of processes, organisations must deal with the challenge of detecting anomalies that can reveal suspicious activities upon an increasing volume of data. To pursue this goal, audit engagements are carried out regularly, and internal auditors and purchase specialists are constantly looking for new methods to automate these processes. This work proposes a methodology to prioritise the investigation of the cases detected in two large purchase datasets from real data. The goal is to contribute to the effectiveness of the companies' control efforts and to increase the performance of carrying out such tasks. A comprehensive Exploratory Data Analysis is carried out before using unsupervised Machine Learning techniques addressed to detect anomalies. A univariate approach has been applied through the z-Score index and the DBSCAN algorithm, while a multivariate analysis is implemented with the k-Means and Isolation Forest algorithms, and the Silhouette index, resulting in each method having a transaction candidates' proposal to be reviewed. An ensemble prioritisation of the candidates is provided jointly with a proposal of explicability methods (LIME, Shapley, SHAP) to help the company specialists in their understanding.
Paper Structure (20 sections, 5 figures, 4 tables)

This paper contains 20 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of the Elbow Curve by encoding strategy (frequency, mean, median and mode). Company 1
  • Figure 2: Comparison of Silhouette coefficient by encoding strategy (frequency, mean, median and mode). Company 1
  • Figure 3: Company1 - Comparison of the Elbow Curves by encoding strategy (frequency, mean, median and mode) with univariate outliers segregated
  • Figure 4: Company1 - Comparison of the Overall Silhouette coefficient – 10% sampled by encoding strategy (frequency, mean, median and mode) with univariate outliers segregated
  • Figure 5: Company2. Example of SHAP values. (left) Force plot for individual explicability. (right) Bar plot for feature explicability