Table of Contents
Fetching ...

Explainable Unsupervised Anomaly Detection with Random Forest

Joshua S. Harvey, Joshua Rosaler, Mingshu Li, Dhruv Desai, Dhagash Mehta

TL;DR

The paper tackles the challenge of unsupervised anomaly detection with minimal preprocessing and robust handling of missing data by learning a similarity-preserving distance via a Random Forest trained to discriminate real data from uniformly generated synthetic data over the data bounds ($RF_{uni}$). It introduces GAP proximities as an interpretable distance metric, constructs a hyperparameter-free outlier score based on distances to a central core, and demonstrates superior anomaly detection performance across a large benchmark (ADBench) compared to traditional detectors. Additionally, the work provides locally explainable predictions by linking outlier scores to Random Forest partition structure and counterfactual trajectories, enabling insights into feature-level contributions. The approach yields practical benefits for scalable, preprocessing-light anomaly detection with built-in visualization potential and explainability, while noting limitations related to contamination and potential sensitivity to extreme values.

Abstract

We describe the use of an unsupervised Random Forest for similarity learning and improved unsupervised anomaly detection. By training a Random Forest to discriminate between real data and synthetic data sampled from a uniform distribution over the real data bounds, a distance measure is obtained that anisometrically transforms the data, expanding distances at the boundary of the data manifold. We show that using distances recovered from this transformation improves the accuracy of unsupervised anomaly detection, compared to other commonly used detectors, demonstrated over a large number of benchmark datasets. As well as improved performance, this method has advantages over other unsupervised anomaly detection methods, including minimal requirements for data preprocessing, native handling of missing data, and potential for visualizations. By relating outlier scores to partitions of the Random Forest, we develop a method for locally explainable anomaly predictions in terms of feature importance.

Explainable Unsupervised Anomaly Detection with Random Forest

TL;DR

The paper tackles the challenge of unsupervised anomaly detection with minimal preprocessing and robust handling of missing data by learning a similarity-preserving distance via a Random Forest trained to discriminate real data from uniformly generated synthetic data over the data bounds (). It introduces GAP proximities as an interpretable distance metric, constructs a hyperparameter-free outlier score based on distances to a central core, and demonstrates superior anomaly detection performance across a large benchmark (ADBench) compared to traditional detectors. Additionally, the work provides locally explainable predictions by linking outlier scores to Random Forest partition structure and counterfactual trajectories, enabling insights into feature-level contributions. The approach yields practical benefits for scalable, preprocessing-light anomaly detection with built-in visualization potential and explainability, while noting limitations related to contamination and potential sensitivity to extreme values.

Abstract

We describe the use of an unsupervised Random Forest for similarity learning and improved unsupervised anomaly detection. By training a Random Forest to discriminate between real data and synthetic data sampled from a uniform distribution over the real data bounds, a distance measure is obtained that anisometrically transforms the data, expanding distances at the boundary of the data manifold. We show that using distances recovered from this transformation improves the accuracy of unsupervised anomaly detection, compared to other commonly used detectors, demonstrated over a large number of benchmark datasets. As well as improved performance, this method has advantages over other unsupervised anomaly detection methods, including minimal requirements for data preprocessing, native handling of missing data, and potential for visualizations. By relating outlier scores to partitions of the Random Forest, we develop a method for locally explainable anomaly predictions in terms of feature importance.

Paper Structure

This paper contains 16 sections, 17 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Embedding two-dimensional Gaussian data with unsupervised RF distances. A) i) Data simulated from a two-dimensional Gaussian distribution. Points with distances from the origin above the 90th percentile are colored red (outliers). ii) A matrix of Euclidean distances, sorted by each point's distance from the origin. B) Distance matrices for Random Forest GAP distances obtained from the (i) $RF_{uni}$ and (ii) ${ExtraTrees}$ models. C) Spearman rank correlation, $\rho$, between distances in measurement space (Euclidean) and RF distances. D) Histograms of inter-point RF distances for inliers (blue) and outliers (red). E) Stress plot for multidimensional scaling (MDS) of RF distances. F) MDS embedding in three dimensions for (i) $RF_{uni}$ and (ii) ${ExtraTrees}$ RF distances. G) MDS embedding in two dimensions for RF distances.
  • Figure 2: Anomaly detection performance on benchmark datasets. A) Direct comparison between anomaly detectors distances from either the ExtraTrees or $RF_{uni}$ Random Forest model, aggregated across benchmark datasets. Comparisons of i) ranking amongst other unsupervised detectors, ii) AUCROC score, and iii) percent of AUCROC score achieved by best performing detector for each dataset. $p$-values indicate result of a Wilcoxon signed-rank test. B) Results for all unsupervised anomaly detectors. i) Boxplots of ranked performance for each detector, sorted left to right by superior performance. ii) Pairwise comparisons computed with a Conover post-hoc test, with p-values adjusted for multiple comparisons via the Holm–Bonferroni method. iii) A critical difference diagram connecting detectors without significant differences at $\alpha=0.1$. Dashed colored lines indicate no critical difference between specific detectors, despite critical differences between their intermediately ranked detectors.
  • Figure 3: Anomaly detection performance with missing data. A) The performance of $RF_{uni}$ and other top-performing unsupervised anomaly detectors with missing data on the Campaign dataset, evaluated with AUCROC. All detectors were tested on data imputed with the mean (dashed lines), while $RF_{uni}$ was also applied directly on missing data (solid line). Shaded area indicates standard deviation over 5 repeats. B) Aggregate performance across all datasets for varying levels of missingness, following mean imputation. AUCROC scores for each dataset are normalized to that of the best-performing detector on complete data.
  • Figure 4: Explainability of $RF_{uni}$ anomaly detector outlier scores. A) A trajectory through the outlier gradient field on two-dimensional Gaussian data. Gradients are shown in red arrows. B) The same trajectory visualized in an MDS embedding of GAP distances computed from the $RF_{uni}$ outlier detection model. C) Cumulative tally of which $RF_{uni}$ partitions are crossed along the trajectory.
  • Figure 5: Explainable anomaly detection with the MNIST dataset. A) The trajectory from an anomaly (digit 4, red) to inliers (digit 9, blue) in the MNIST dataset, visualized in a t-SNE embedding of the input data. B) The same trajectory visualized in an MDS embedding of GAP distances computed from the $RF_{uni}$ outlier detection model. C) Outlier scores of each data point along the trajectory from outlier to inliers. The corresponding image for each data point is shown, and its ground-truth anomaly label indicated by marker color. D) Visualization of feature importance of each pixel for the $RF_{uni}$ model, as a tally of how many partitions target each pixel. E) Counterfactual explanation of an outlier, showing the number of $RF_{uni}$ model partitions crossed (due to either increases or decreases in pixel values) to move directly from the first image to the last image in the trajectory. Blue indicates pixels where decreases in value explain reductions in outlier score, while red indicates where increases in the pixel value explain reductions in outlier score. F) Visualization of total $RF_{uni}$ partitions crossed, integrated over the counterfactual trajectory. Same color coding as in (E).