Nearest-Neighbour-Induced Isolation Similarity and its Impact on Density-Based Clustering
Xiaoyu Qin, Kai Ming Ting, Ye Zhu, Vincent CS Lee
TL;DR
This work addresses clustering under varied densities by adopting a data-dependent similarity, the Isolation Kernel, and replaces tree-based partitions with nearest-neighbour induced Voronoi partitions. It provides a formal proof of the characteristic of Isolation Similarity, and introduces mass-connected clusters with MBSCAN, which replaces the fixed-radius density notion with a data-driven dissimilarity. Empirically, MBSCAN with aNearest-Neighbour-Induced Isolation Similarity (aNNE) outperforms DP and commonly used DBSCAN variants on most datasets, especially in high dimensions, while offering favorable computation via $O(\psi)$ operations for the similarity and GPU acceleration. Overall, the paper demonstrates both theoretical and practical benefits of a mass-based, density-adaptive approach to clustering, expanding the utility of Isolation Kernel beyond classification into improved density-based clustering performance.
Abstract
A recent proposal of data dependent similarity called Isolation Kernel/Similarity has enabled SVM to produce better classification accuracy. We identify shortcomings of using a tree method to implement Isolation Similarity; and propose a nearest neighbour method instead. We formally prove the characteristic of Isolation Similarity with the use of the proposed method. The impact of Isolation Similarity on density-based clustering is studied here. We show for the first time that the clustering performance of the classic density-based clustering algorithm DBSCAN can be significantly uplifted to surpass that of the recent density-peak clustering algorithm DP. This is achieved by simply replacing the distance measure with the proposed nearest-neighbour-induced Isolation Similarity in DBSCAN, leaving the rest of the procedure unchanged. A new type of clusters called mass-connected clusters is formally defined. We show that DBSCAN, which detects density-connected clusters, becomes one which detects mass-connected clusters, when the distance measure is replaced with the proposed similarity. We also provide the condition under which mass-connected clusters can be detected, while density-connected clusters cannot.
