Anomaly detection using data depth: multivariate case
Pavlo Mozharovskyi, Romain Valla
TL;DR
This paper investigates multivariate anomaly detection by ranking observations with data depth, describing a framework where anomalies are labeled based on a depth function $D({oldsymbol x}|{oldsymbol X})$ and a threshold $t$. It reviews multiple depth notions (e.g., $D^{\text{Mah}}$, $D^{\text{hfsp}}$, $D^{\text{smpv(ai)}}$, $D^{\text{prj}}$, $D^{\text{prj(as)}}$, $D^{\text{smp}}$), analyzes their invariance, robustness, and computational properties, and discusses practical choices for depth and thresholding in simulated industrial settings. The paper compares depth-based detection with methods like LOF, OC-SVM, and isolation forest, highlighting robustness to contaminated training data, extrapolation to unseen anomalies, and the ability to explain anomalies via interpretable directions, demonstrated on simulated data and 40 real data sets. It concludes that data depth provides a nonparametric, affine-invariant, interpretable anomaly-detection framework that remains feasible with approximate computations in moderate to high dimensions, and it offers open-source code for practitioners.
Abstract
Anomaly detection is a branch of data analysis and machine learning which aims at identifying observations that exhibit abnormal behaviour. Be it measurement errors, disease development, severe weather, production quality default(s) (items) or failed equipment, financial frauds or crisis events, their on-time identification, isolation and explanation constitute an important task in almost any branch of science and industry. By providing a robust ordering, data depth - statistical function that measures belongingness of any point of the space to a data set - becomes a particularly useful tool for detection of anomalies. Already known for its theoretical properties, data depth has undergone substantial computational developments in the last decade and particularly recent years, which has made it applicable for contemporary-sized problems of data analysis and machine learning. In this article, data depth is studied as an efficient anomaly detection tool, assigning abnormality labels to observations with lower depth values, in a multivariate setting. Practical questions of necessity and reasonability of invariances and shape of the depth function, its robustness and computational complexity, choice of the threshold are discussed. Illustrations include use-cases that underline advantageous behaviour of data depth in various settings.
