Table of Contents
Fetching ...

Anomaly detection using data depth: multivariate case

Pavlo Mozharovskyi, Romain Valla

TL;DR

This paper investigates multivariate anomaly detection by ranking observations with data depth, describing a framework where anomalies are labeled based on a depth function $D({oldsymbol x}|{oldsymbol X})$ and a threshold $t$. It reviews multiple depth notions (e.g., $D^{\text{Mah}}$, $D^{\text{hfsp}}$, $D^{\text{smpv(ai)}}$, $D^{\text{prj}}$, $D^{\text{prj(as)}}$, $D^{\text{smp}}$), analyzes their invariance, robustness, and computational properties, and discusses practical choices for depth and thresholding in simulated industrial settings. The paper compares depth-based detection with methods like LOF, OC-SVM, and isolation forest, highlighting robustness to contaminated training data, extrapolation to unseen anomalies, and the ability to explain anomalies via interpretable directions, demonstrated on simulated data and 40 real data sets. It concludes that data depth provides a nonparametric, affine-invariant, interpretable anomaly-detection framework that remains feasible with approximate computations in moderate to high dimensions, and it offers open-source code for practitioners.

Abstract

Anomaly detection is a branch of data analysis and machine learning which aims at identifying observations that exhibit abnormal behaviour. Be it measurement errors, disease development, severe weather, production quality default(s) (items) or failed equipment, financial frauds or crisis events, their on-time identification, isolation and explanation constitute an important task in almost any branch of science and industry. By providing a robust ordering, data depth - statistical function that measures belongingness of any point of the space to a data set - becomes a particularly useful tool for detection of anomalies. Already known for its theoretical properties, data depth has undergone substantial computational developments in the last decade and particularly recent years, which has made it applicable for contemporary-sized problems of data analysis and machine learning. In this article, data depth is studied as an efficient anomaly detection tool, assigning abnormality labels to observations with lower depth values, in a multivariate setting. Practical questions of necessity and reasonability of invariances and shape of the depth function, its robustness and computational complexity, choice of the threshold are discussed. Illustrations include use-cases that underline advantageous behaviour of data depth in various settings.

Anomaly detection using data depth: multivariate case

TL;DR

This paper investigates multivariate anomaly detection by ranking observations with data depth, describing a framework where anomalies are labeled based on a depth function and a threshold . It reviews multiple depth notions (e.g., , , , , , ), analyzes their invariance, robustness, and computational properties, and discusses practical choices for depth and thresholding in simulated industrial settings. The paper compares depth-based detection with methods like LOF, OC-SVM, and isolation forest, highlighting robustness to contaminated training data, extrapolation to unseen anomalies, and the ability to explain anomalies via interpretable directions, demonstrated on simulated data and 40 real data sets. It concludes that data depth provides a nonparametric, affine-invariant, interpretable anomaly-detection framework that remains feasible with approximate computations in moderate to high dimensions, and it offers open-source code for practitioners.

Abstract

Anomaly detection is a branch of data analysis and machine learning which aims at identifying observations that exhibit abnormal behaviour. Be it measurement errors, disease development, severe weather, production quality default(s) (items) or failed equipment, financial frauds or crisis events, their on-time identification, isolation and explanation constitute an important task in almost any branch of science and industry. By providing a robust ordering, data depth - statistical function that measures belongingness of any point of the space to a data set - becomes a particularly useful tool for detection of anomalies. Already known for its theoretical properties, data depth has undergone substantial computational developments in the last decade and particularly recent years, which has made it applicable for contemporary-sized problems of data analysis and machine learning. In this article, data depth is studied as an efficient anomaly detection tool, assigning abnormality labels to observations with lower depth values, in a multivariate setting. Practical questions of necessity and reasonability of invariances and shape of the depth function, its robustness and computational complexity, choice of the threshold are discussed. Illustrations include use-cases that underline advantageous behaviour of data depth in various settings.
Paper Structure (18 sections, 16 equations, 18 figures, 2 tables)

This paper contains 18 sections, 16 equations, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Four normal observations (green dots) and four anomalies (red pluses and cross); contours of Mahalanobis (black solid) and projection (blue dashed) depths. Left: A sample of $500$ bivariate Gaussian observations (black pixels). Right: $17$ bivariate Gaussian observations (gray dots).
  • Figure 2: $90$ observations stemming from bivariate normal distribution (blue dots) contaminated with $10$ observations (red pluses). Depth contours at three levels: minimal depth of normal observations (blue dotted line), maximal depth of $10$ anomalies contaminating the training sample (red dash-dotted line), minimal depth of $10$ anomalies contaminating the training sample (red dashed line). Left: depth values (in color) for projection depth. Right: depth values (in color) for the halfspace depth (with white corresponding to zero). Color scale for both plots is depicted on the right side.
  • Figure 3: Ordered depth values for projection and halfspace depth, with normal data corresponding to blue points and anomalies depicted with red pluses and crosses. Left: For $100$ observations of the training data from first example in Section \ref{['sec:suitability']}. Right: For $125$ observations of the training data from second example in Section \ref{['sec:suitability']}.
  • Figure 4: $90$ observations stemming from bivariate normal distribution (blue dots) contaminated with $10$ observations (red pluses). Depth contours at three levels: minimal depth of normal observations (blue dotted line), maximal depth of $10$ anomalies contaminating the training sample (red dash-dotted line), minimal depth of $10$ anomalies contaminating the training sample (red dashed line). Left: depth values (in color) for simplicial volume depth. Right: depth values (in color) for simplicial depth (with white corresponding to zero). Color scale for both plots is depicted on the right side.
  • Figure 5: $90$ observations stemming from bivariate normal distribution (blue dots) contaminated with $10$ clustered (red pluses) and $25$ masking (red crosses) anomalies. Depth contours at three levels: minimal depth of normal observations (blue dotted line), maximal depth of all $35$ anomalies contaminating the training sample (red dash-dotted line), minimal depth of all $35$ anomalies contaminating the training sample (red dashed line). Left: depth values (in color) for simplicial volume depth. Right: depth values (in color) for simplicial depth (with white corresponding to zero). Color scale for both plots is depicted on the right side.
  • ...and 13 more figures