Block-Diagonal Guided DBSCAN Clustering
Weibing Zhao
TL;DR
DBSCAN often struggles with high-dimensional, large-scale data and is sensitive to the neighborhood parameter $\epsilon$ and density threshold $\delta$. This work introduces BD-DBSCAN, which builds a similarity graph that can be permuted to a block-diagonal form and then identifies clustering structure by grouping diagonal blocks. It advances the pipeline with a gradient-descent-based permutation routine, a DBSCAN-based points traversal that yields an augmented cluster ordering, and a split-and-refine diagonal-block search with theoretical guarantees. The method offers robustness to density variation, scalability to large datasets, and intuitive visualization of the clustering process. Empirical evaluation on twelve real-world benchmarks shows consistent superiority over state-of-the-art methods.
Abstract
Cluster analysis plays a crucial role in database mining, and one of the most widely used algorithms in this field is DBSCAN. However, DBSCAN has several limitations, such as difficulty in handling high-dimensional large-scale data, sensitivity to input parameters, and lack of robustness in producing clustering results. This paper introduces an improved version of DBSCAN that leverages the block-diagonal property of the similarity graph to guide the clustering procedure of DBSCAN. The key idea is to construct a graph that measures the similarity between high-dimensional large-scale data points and has the potential to be transformed into a block-diagonal form through an unknown permutation, followed by a cluster-ordering procedure to generate the desired permutation. The clustering structure can be easily determined by identifying the diagonal blocks in the permuted graph. We propose a gradient descent-based method to solve the proposed problem. Additionally, we develop a DBSCAN-based points traversal algorithm that identifies clusters with high densities in the graph and generates an augmented ordering of clusters. The block-diagonal structure of the graph is then achieved through permutation based on the traversal order, providing a flexible foundation for both automatic and interactive cluster analysis. We introduce a split-and-refine algorithm to automatically search for all diagonal blocks in the permuted graph with theoretically optimal guarantees under specific cases. We extensively evaluate our proposed approach on twelve challenging real-world benchmark clustering datasets and demonstrate its superior performance compared to the state-of-the-art clustering method on every dataset.
