Table of Contents
Fetching ...

Unsupervised Parameter-free Outlier Detection using HDBSCAN* Outlier Profiles

Kushankur Ghosh, Murilo Coelho Naldi, Jörg Sander, Euijin Choo

TL;DR

An unsupervised strategy to find the "best" minpts value, leveraging the range of GLOSH scores across minpts values to identify the value for which GLOSH scores can best identify outliers from the rest of the dataset.

Abstract

In machine learning and data mining, outliers are data points that significantly differ from the dataset and often introduce irrelevant information that can induce bias in its statistics and models. Therefore, unsupervised methods are crucial to detect outliers if there is limited or no information about them. Global-Local Outlier Scores based on Hierarchies (GLOSH) is an unsupervised outlier detection method within HDBSCAN*, a state-of-the-art hierarchical clustering method. GLOSH estimates outlier scores for each data point by comparing its density to the highest density of the region they reside in the HDBSCAN* hierarchy. GLOSH may be sensitive to HDBSCAN*'s minpts parameter that influences density estimation. With limited knowledge about the data, choosing an appropriate minpts value beforehand is challenging as one or some minpts values may better represent the underlying cluster structure than others. Additionally, in the process of searching for ``potential outliers'', one has to define the number of outliers n a dataset has, which may be impractical and is often unknown. In this paper, we propose an unsupervised strategy to find the ``best'' minpts value, leveraging the range of GLOSH scores across minpts values to identify the value for which GLOSH scores can best identify outliers from the rest of the dataset. Moreover, we propose an unsupervised strategy to estimate a threshold for classifying points into inliers and (potential) outliers without the need to pre-define any value. Our experiments show that our strategies can automatically find the minpts value and threshold that yield the best or near best outlier detection results using GLOSH.

Unsupervised Parameter-free Outlier Detection using HDBSCAN* Outlier Profiles

TL;DR

An unsupervised strategy to find the "best" minpts value, leveraging the range of GLOSH scores across minpts values to identify the value for which GLOSH scores can best identify outliers from the rest of the dataset.

Abstract

In machine learning and data mining, outliers are data points that significantly differ from the dataset and often introduce irrelevant information that can induce bias in its statistics and models. Therefore, unsupervised methods are crucial to detect outliers if there is limited or no information about them. Global-Local Outlier Scores based on Hierarchies (GLOSH) is an unsupervised outlier detection method within HDBSCAN*, a state-of-the-art hierarchical clustering method. GLOSH estimates outlier scores for each data point by comparing its density to the highest density of the region they reside in the HDBSCAN* hierarchy. GLOSH may be sensitive to HDBSCAN*'s minpts parameter that influences density estimation. With limited knowledge about the data, choosing an appropriate minpts value beforehand is challenging as one or some minpts values may better represent the underlying cluster structure than others. Additionally, in the process of searching for ``potential outliers'', one has to define the number of outliers n a dataset has, which may be impractical and is often unknown. In this paper, we propose an unsupervised strategy to find the ``best'' minpts value, leveraging the range of GLOSH scores across minpts values to identify the value for which GLOSH scores can best identify outliers from the rest of the dataset. Moreover, we propose an unsupervised strategy to estimate a threshold for classifying points into inliers and (potential) outliers without the need to pre-define any value. Our experiments show that our strategies can automatically find the minpts value and threshold that yield the best or near best outlier detection results using GLOSH.

Paper Structure

This paper contains 19 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Banana Dataset with different kinds of outliers.
  • Figure 2: Comparing GLOSH– Profiles ($P\Gamma$) with Precision@n (P@n) obtained by GLOSH at different $min_{pts}$ values: The black line in Fig. \ref{['fig:zenodo_banana_global_prec']} to \ref{['fig:zenodo_banana_local_prec']} shows the P@n at every $min_{pts}$ value. The outliers profiles are in red while the inliers profiles are in gray. The green (dashed) line denotes the specific $min_{pts}$ value where GLOSH first achieves the best P@n within the range $[2,100]$ of $min_{pts}$ values.
  • Figure 3: Comparing the ORD– Profiles ($R_{m_{max}}$) with GLOSH's P@n across different $min_{pts}$ values. We highlighted (with a dashed line) the specific $min_{pts}$ value where GLOSH first records the best P@n within a range [2, 100] of $min_{pts}$ values.
  • Figure 4: Illustrating the process of finding the Elbow of the Outlier Rank Dissimilarity– Profile on the Banana Dataset with Global Outliers
  • Figure 5: Banana Dataset: Sorted Sequence of GLOSH scores at $min_{pts} = m^{*}$. The x-axis presents the data points $x_i$ in ascending order of their GLOSH scores, and the y-axis represents the corresponding GLOSH scores $\Gamma_{m^*}(x_i)$. The green dots represent the GLOSH scores of inliers, and the red represents that of outliers.
  • ...and 2 more figures