Table of Contents
Fetching ...

Nearest-Neighbour-Induced Isolation Similarity and its Impact on Density-Based Clustering

Xiaoyu Qin, Kai Ming Ting, Ye Zhu, Vincent CS Lee

TL;DR

This work addresses clustering under varied densities by adopting a data-dependent similarity, the Isolation Kernel, and replaces tree-based partitions with nearest-neighbour induced Voronoi partitions. It provides a formal proof of the characteristic of Isolation Similarity, and introduces mass-connected clusters with MBSCAN, which replaces the fixed-radius density notion with a data-driven dissimilarity. Empirically, MBSCAN with aNearest-Neighbour-Induced Isolation Similarity (aNNE) outperforms DP and commonly used DBSCAN variants on most datasets, especially in high dimensions, while offering favorable computation via $O(\psi)$ operations for the similarity and GPU acceleration. Overall, the paper demonstrates both theoretical and practical benefits of a mass-based, density-adaptive approach to clustering, expanding the utility of Isolation Kernel beyond classification into improved density-based clustering performance.

Abstract

A recent proposal of data dependent similarity called Isolation Kernel/Similarity has enabled SVM to produce better classification accuracy. We identify shortcomings of using a tree method to implement Isolation Similarity; and propose a nearest neighbour method instead. We formally prove the characteristic of Isolation Similarity with the use of the proposed method. The impact of Isolation Similarity on density-based clustering is studied here. We show for the first time that the clustering performance of the classic density-based clustering algorithm DBSCAN can be significantly uplifted to surpass that of the recent density-peak clustering algorithm DP. This is achieved by simply replacing the distance measure with the proposed nearest-neighbour-induced Isolation Similarity in DBSCAN, leaving the rest of the procedure unchanged. A new type of clusters called mass-connected clusters is formally defined. We show that DBSCAN, which detects density-connected clusters, becomes one which detects mass-connected clusters, when the distance measure is replaced with the proposed similarity. We also provide the condition under which mass-connected clusters can be detected, while density-connected clusters cannot.

Nearest-Neighbour-Induced Isolation Similarity and its Impact on Density-Based Clustering

TL;DR

This work addresses clustering under varied densities by adopting a data-dependent similarity, the Isolation Kernel, and replaces tree-based partitions with nearest-neighbour induced Voronoi partitions. It provides a formal proof of the characteristic of Isolation Similarity, and introduces mass-connected clusters with MBSCAN, which replaces the fixed-radius density notion with a data-driven dissimilarity. Empirically, MBSCAN with aNearest-Neighbour-Induced Isolation Similarity (aNNE) outperforms DP and commonly used DBSCAN variants on most datasets, especially in high dimensions, while offering favorable computation via operations for the similarity and GPU acceleration. Overall, the paper demonstrates both theoretical and practical benefits of a mass-based, density-adaptive approach to clustering, expanding the utility of Isolation Kernel beyond classification into improved density-based clustering performance.

Abstract

A recent proposal of data dependent similarity called Isolation Kernel/Similarity has enabled SVM to produce better classification accuracy. We identify shortcomings of using a tree method to implement Isolation Similarity; and propose a nearest neighbour method instead. We formally prove the characteristic of Isolation Similarity with the use of the proposed method. The impact of Isolation Similarity on density-based clustering is studied here. We show for the first time that the clustering performance of the classic density-based clustering algorithm DBSCAN can be significantly uplifted to surpass that of the recent density-peak clustering algorithm DP. This is achieved by simply replacing the distance measure with the proposed nearest-neighbour-induced Isolation Similarity in DBSCAN, leaving the rest of the procedure unchanged. A new type of clusters called mass-connected clusters is formally defined. We show that DBSCAN, which detects density-connected clusters, becomes one which detects mass-connected clusters, when the distance measure is replaced with the proposed similarity. We also provide the condition under which mass-connected clusters can be detected, while density-connected clusters cannot.

Paper Structure

This paper contains 13 sections, 1 theorem, 14 equations, 6 figures, 3 tables.

Key Result

Lemma 1

$\forall x, y \in \mathcal{X}_\mathsf{S}$ (sparse region) and $\forall x',y' \in \mathcal{X}_\mathsf{T}$ (dense region) such that $\forall_{z\in \mathcal{X}_\mathsf{S}, z'\in \mathcal{X}_\mathsf{T}} \ \rho(z)<\rho(z')$, the nearest neighbour-induced Isolation Similarity $K_\psi$ has the characterist

Figures (6)

  • Figure 1: Examples of two isolation partitioning mechanisms: Axis-parallel versus nearest neighbour (NN). On a dataset having two (uniform) densities, i.e., the right half has a higher density than the left half.
  • Figure 2: (a) Reference points used in the simulations, where inter-point distance $\parallel x-y \parallel\ =\ \parallel x'- y'\parallel$ increases. Simulation results as $\psi$ increases (b); and as inter-point distance increases (c). $t=10000$ is used.
  • Figure 3: Contour plots of $\mathfrak p_\imath$ on the Thyroid dataset (mapped to 2 dimensions using MDS borg2012applied). $\psi=14$ is used in aNNE and iForest.
  • Figure 4: (a) A hard distribution for DBSCAN as estimated by $N_\epsilon$, where DBSCAN (which uses $N_\epsilon$) fails to detect all clusters using a threshold. (b) The distribution estimated by $M_\alpha$ from the same dataset, where MBSCAN (which uses $M_\alpha$) succeeds in detecting all clusters using a threshold.
  • Figure 5: Change of neighbourhood density/mass wrt its parameter. $N_\epsilon$ uses $\ell_2$; and $M_\alpha$ uses $\mathfrak p_\imath$-aNNE. The peak numbers and valley numbers refer to those shown in Figure \ref{['fig_mass-estimation']}(a).
  • ...and 1 more figures

Theorems & Definitions (6)

  • Definition 1
  • Definition 2
  • Lemma 1
  • PROOF 1
  • Definition 3
  • Definition 4