Table of Contents
Fetching ...

Statistical Inference for Clustering-based Anomaly Detection

Nguyen Thi Minh Phu, Duong Tan Loc, Vo Nguyen Le Duy

TL;DR

SI-CLAD addresses the unreliability of clustering-based anomaly detection by applying Selective Inference to test anomalies detected by DBSCAN, thereby guaranteeing false-positive control at a pre-specified level $α$. The method conditions on the clustering outcome and a nuisance component to produce a valid selective p-value, reducing the inference to a one-dimensional truncation problem along a data line $X(z)=a+bz$ and computing the truncation region $\mathcal Z$ via over-conditioning and parametric programming. It extends naturally to multi-dimensional data using a Kronecker-structured statistic, enabling valid inference in higher dimensions. Empirical results on synthetic and real datasets show controlled FPR and improved true detection rates, illustrating the practical impact of SI-CLAD for reliable, unsupervised anomaly detection.

Abstract

Unsupervised anomaly detection (AD) is a fundamental problem in machine learning and statistics. A popular approach to unsupervised AD is clustering-based detection. However, this method lacks the ability to guarantee the reliability of the detected anomalies. In this paper, we propose SI-CLAD (Statistical Inference for CLustering-based Anomaly Detection), a novel statistical framework for testing the clustering-based AD results. The key strength of SI-CLAD lies in its ability to rigorously control the probability of falsely identifying anomalies, maintaining it below a pre-specified significance level $α$ (e.g., $α= 0.05$). By analyzing the selection mechanism inherent in clustering-based AD and leveraging the Selective Inference (SI) framework, we prove that false detection control is attainable. Moreover, we introduce a strategy to boost the true detection rate, enhancing the overall performance of SI-CLAD. Extensive experiments on synthetic and real-world datasets provide strong empirical support for our theoretical findings, showcasing the superior performance of the proposed method.

Statistical Inference for Clustering-based Anomaly Detection

TL;DR

SI-CLAD addresses the unreliability of clustering-based anomaly detection by applying Selective Inference to test anomalies detected by DBSCAN, thereby guaranteeing false-positive control at a pre-specified level . The method conditions on the clustering outcome and a nuisance component to produce a valid selective p-value, reducing the inference to a one-dimensional truncation problem along a data line and computing the truncation region via over-conditioning and parametric programming. It extends naturally to multi-dimensional data using a Kronecker-structured statistic, enabling valid inference in higher dimensions. Empirical results on synthetic and real datasets show controlled FPR and improved true detection rates, illustrating the practical impact of SI-CLAD for reliable, unsupervised anomaly detection.

Abstract

Unsupervised anomaly detection (AD) is a fundamental problem in machine learning and statistics. A popular approach to unsupervised AD is clustering-based detection. However, this method lacks the ability to guarantee the reliability of the detected anomalies. In this paper, we propose SI-CLAD (Statistical Inference for CLustering-based Anomaly Detection), a novel statistical framework for testing the clustering-based AD results. The key strength of SI-CLAD lies in its ability to rigorously control the probability of falsely identifying anomalies, maintaining it below a pre-specified significance level (e.g., ). By analyzing the selection mechanism inherent in clustering-based AD and leveraging the Selective Inference (SI) framework, we prove that false detection control is attainable. Moreover, we introduce a strategy to boost the true detection rate, enhancing the overall performance of SI-CLAD. Extensive experiments on synthetic and real-world datasets provide strong empirical support for our theoretical findings, showcasing the superior performance of the proposed method.

Paper Structure

This paper contains 17 sections, 3 theorems, 36 equations, 9 figures, 2 algorithms.

Key Result

Lemma 1

The selective $p$-value proposed in (eq:selective_p) satisfies the property of a valid $p$-value:

Figures (9)

  • Figure 1: Illustration of the proposed SI-CLAD method. Performing clustering-based anomaly detection (AD) produces wrong anomalies (C, D). The naive $p$-values are even small for falsely detected anomalies. With the proposed SI-CLAD, we can identify both false positive (FP) and true positive (TP) detections, i.e., large p-values for FPs and small p-values for TPs.
  • Figure 2: A schematic illustration of the proposed method. By applying DBSCAN to the observed data, we obtain a set of anomalies. Then, we parametrize the data with a scalar parameter $z$ in the dimension of the test statistic to identify the truncation region $\mathcal{Z}$ whose data have the same result of anomaly detection as the observed data. Finally, the valid inference is conducted conditional on $\mathcal{Z}$. We utilize the concept of divide-and-conquer and introduce a hierarchical line search method for efficiently characterizing the truncation region $\mathcal{Z}$
  • Figure 3: FPR and TPR in univariate case
  • Figure 4: FPR and TPR in multi-dimensional case
  • Figure 5: FPR and TPR in the case of correlated data
  • ...and 4 more figures

Theorems & Definitions (7)

  • Remark 1
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof