Scalable Density-based Clustering with Random Projections
Haochuan Xu, Ninh Pham
TL;DR
The paper tackles the scalability challenge of density-based clustering in high dimensions by introducing sDBSCAN, which leverages neighborhood-preserving random projections to rapidly identify core points and neighborhoods, and sOPTICS for interactive exploration of clustering structure. The authors provide theoretical guarantees showing that sDBSCAN yields a clustering structure close to DBSCAN under mild conditions for cosine distance, and extend both methods to $L_2$, $L_1$, $\boldsymbol{\chi^2}$, and Jensen-Shannon distances via random kernel features. They also demonstrate practical efficacy and scalability through extensive experiments on million-point datasets (e.g., Mnist, Mnist8m, Pamap2), achieving orders of magnitude faster performance than scikit-learn equivalents while maintaining competitive clustering quality. The approach is multi-thread friendly and benefits from random-projection-based ANNS, enabling fast, scalable density-based clustering across diverse distance measures with real-world impact for large-scale data analysis.
Abstract
We present sDBSCAN, a scalable density-based clustering algorithm in high dimensions with cosine distance. Utilizing the neighborhood-preserving property of random projections, sDBSCAN can quickly identify core points and their neighborhoods, the primary hurdle of density-based clustering. Theoretically, sDBSCAN outputs a clustering structure similar to DBSCAN under mild conditions with high probability. To further facilitate sDBSCAN, we present sOPTICS, a scalable OPTICS for interactive exploration of the intrinsic clustering structure. We also extend sDBSCAN and sOPTICS to L2, L1, $χ^2$, and Jensen-Shannon distances via random kernel features. Empirically, sDBSCAN is significantly faster and provides higher accuracy than many other clustering algorithms on real-world million-point data sets. On these data sets, sDBSCAN and sOPTICS run in a few minutes, while the scikit-learn's counterparts demand several hours or cannot run due to memory constraints.
