Statistical Inference for Clustering-based Anomaly Detection

Nguyen Thi Minh Phu; Duong Tan Loc; Vo Nguyen Le Duy

Statistical Inference for Clustering-based Anomaly Detection

Nguyen Thi Minh Phu, Duong Tan Loc, Vo Nguyen Le Duy

TL;DR

SI-CLAD addresses the unreliability of clustering-based anomaly detection by applying Selective Inference to test anomalies detected by DBSCAN, thereby guaranteeing false-positive control at a pre-specified level $α$. The method conditions on the clustering outcome and a nuisance component to produce a valid selective p-value, reducing the inference to a one-dimensional truncation problem along a data line $X(z)=a+bz$ and computing the truncation region $\mathcal Z$ via over-conditioning and parametric programming. It extends naturally to multi-dimensional data using a Kronecker-structured statistic, enabling valid inference in higher dimensions. Empirical results on synthetic and real datasets show controlled FPR and improved true detection rates, illustrating the practical impact of SI-CLAD for reliable, unsupervised anomaly detection.

Abstract

Unsupervised anomaly detection (AD) is a fundamental problem in machine learning and statistics. A popular approach to unsupervised AD is clustering-based detection. However, this method lacks the ability to guarantee the reliability of the detected anomalies. In this paper, we propose SI-CLAD (Statistical Inference for CLustering-based Anomaly Detection), a novel statistical framework for testing the clustering-based AD results. The key strength of SI-CLAD lies in its ability to rigorously control the probability of falsely identifying anomalies, maintaining it below a pre-specified significance level $α$ (e.g., $α= 0.05$). By analyzing the selection mechanism inherent in clustering-based AD and leveraging the Selective Inference (SI) framework, we prove that false detection control is attainable. Moreover, we introduce a strategy to boost the true detection rate, enhancing the overall performance of SI-CLAD. Extensive experiments on synthetic and real-world datasets provide strong empirical support for our theoretical findings, showcasing the superior performance of the proposed method.

Statistical Inference for Clustering-based Anomaly Detection

TL;DR

. The method conditions on the clustering outcome and a nuisance component to produce a valid selective p-value, reducing the inference to a one-dimensional truncation problem along a data line

and computing the truncation region

via over-conditioning and parametric programming. It extends naturally to multi-dimensional data using a Kronecker-structured statistic, enabling valid inference in higher dimensions. Empirical results on synthetic and real datasets show controlled FPR and improved true detection rates, illustrating the practical impact of SI-CLAD for reliable, unsupervised anomaly detection.

Abstract

(e.g.,

). By analyzing the selection mechanism inherent in clustering-based AD and leveraging the Selective Inference (SI) framework, we prove that false detection control is attainable. Moreover, we introduce a strategy to boost the true detection rate, enhancing the overall performance of SI-CLAD. Extensive experiments on synthetic and real-world datasets provide strong empirical support for our theoretical findings, showcasing the superior performance of the proposed method.

Statistical Inference for Clustering-based Anomaly Detection

TL;DR

Abstract

Statistical Inference for Clustering-based Anomaly Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (7)