Statistical Inference for Clustering-based Anomaly Detection
Nguyen Thi Minh Phu, Duong Tan Loc, Vo Nguyen Le Duy
TL;DR
SI-CLAD addresses the unreliability of clustering-based anomaly detection by applying Selective Inference to test anomalies detected by DBSCAN, thereby guaranteeing false-positive control at a pre-specified level $α$. The method conditions on the clustering outcome and a nuisance component to produce a valid selective p-value, reducing the inference to a one-dimensional truncation problem along a data line $X(z)=a+bz$ and computing the truncation region $\mathcal Z$ via over-conditioning and parametric programming. It extends naturally to multi-dimensional data using a Kronecker-structured statistic, enabling valid inference in higher dimensions. Empirical results on synthetic and real datasets show controlled FPR and improved true detection rates, illustrating the practical impact of SI-CLAD for reliable, unsupervised anomaly detection.
Abstract
Unsupervised anomaly detection (AD) is a fundamental problem in machine learning and statistics. A popular approach to unsupervised AD is clustering-based detection. However, this method lacks the ability to guarantee the reliability of the detected anomalies. In this paper, we propose SI-CLAD (Statistical Inference for CLustering-based Anomaly Detection), a novel statistical framework for testing the clustering-based AD results. The key strength of SI-CLAD lies in its ability to rigorously control the probability of falsely identifying anomalies, maintaining it below a pre-specified significance level $α$ (e.g., $α= 0.05$). By analyzing the selection mechanism inherent in clustering-based AD and leveraging the Selective Inference (SI) framework, we prove that false detection control is attainable. Moreover, we introduce a strategy to boost the true detection rate, enhancing the overall performance of SI-CLAD. Extensive experiments on synthetic and real-world datasets provide strong empirical support for our theoretical findings, showcasing the superior performance of the proposed method.
