Table of Contents
Fetching ...

Scalable Density-based Clustering with Random Projections

Haochuan Xu, Ninh Pham

TL;DR

The paper tackles the scalability challenge of density-based clustering in high dimensions by introducing sDBSCAN, which leverages neighborhood-preserving random projections to rapidly identify core points and neighborhoods, and sOPTICS for interactive exploration of clustering structure. The authors provide theoretical guarantees showing that sDBSCAN yields a clustering structure close to DBSCAN under mild conditions for cosine distance, and extend both methods to $L_2$, $L_1$, $\boldsymbol{\chi^2}$, and Jensen-Shannon distances via random kernel features. They also demonstrate practical efficacy and scalability through extensive experiments on million-point datasets (e.g., Mnist, Mnist8m, Pamap2), achieving orders of magnitude faster performance than scikit-learn equivalents while maintaining competitive clustering quality. The approach is multi-thread friendly and benefits from random-projection-based ANNS, enabling fast, scalable density-based clustering across diverse distance measures with real-world impact for large-scale data analysis.

Abstract

We present sDBSCAN, a scalable density-based clustering algorithm in high dimensions with cosine distance. Utilizing the neighborhood-preserving property of random projections, sDBSCAN can quickly identify core points and their neighborhoods, the primary hurdle of density-based clustering. Theoretically, sDBSCAN outputs a clustering structure similar to DBSCAN under mild conditions with high probability. To further facilitate sDBSCAN, we present sOPTICS, a scalable OPTICS for interactive exploration of the intrinsic clustering structure. We also extend sDBSCAN and sOPTICS to L2, L1, $χ^2$, and Jensen-Shannon distances via random kernel features. Empirically, sDBSCAN is significantly faster and provides higher accuracy than many other clustering algorithms on real-world million-point data sets. On these data sets, sDBSCAN and sOPTICS run in a few minutes, while the scikit-learn's counterparts demand several hours or cannot run due to memory constraints.

Scalable Density-based Clustering with Random Projections

TL;DR

The paper tackles the scalability challenge of density-based clustering in high dimensions by introducing sDBSCAN, which leverages neighborhood-preserving random projections to rapidly identify core points and neighborhoods, and sOPTICS for interactive exploration of clustering structure. The authors provide theoretical guarantees showing that sDBSCAN yields a clustering structure close to DBSCAN under mild conditions for cosine distance, and extend both methods to , , , and Jensen-Shannon distances via random kernel features. They also demonstrate practical efficacy and scalability through extensive experiments on million-point datasets (e.g., Mnist, Mnist8m, Pamap2), achieving orders of magnitude faster performance than scikit-learn equivalents while maintaining competitive clustering quality. The approach is multi-thread friendly and benefits from random-projection-based ANNS, enabling fast, scalable density-based clustering across diverse distance measures with real-world impact for large-scale data analysis.

Abstract

We present sDBSCAN, a scalable density-based clustering algorithm in high dimensions with cosine distance. Utilizing the neighborhood-preserving property of random projections, sDBSCAN can quickly identify core points and their neighborhoods, the primary hurdle of density-based clustering. Theoretically, sDBSCAN outputs a clustering structure similar to DBSCAN under mild conditions with high probability. To further facilitate sDBSCAN, we present sOPTICS, a scalable OPTICS for interactive exploration of the intrinsic clustering structure. We also extend sDBSCAN and sOPTICS to L2, L1, , and Jensen-Shannon distances via random kernel features. Empirically, sDBSCAN is significantly faster and provides higher accuracy than many other clustering algorithms on real-world million-point data sets. On these data sets, sDBSCAN and sOPTICS run in a few minutes, while the scikit-learn's counterparts demand several hours or cannot run due to memory constraints.
Paper Structure (33 sections, 4 theorems, 6 equations, 17 figures, 5 tables, 6 algorithms)

This paper contains 33 sections, 4 theorems, 6 equations, 17 figures, 5 tables, 6 algorithms.

Key Result

lemma 1

CEOs For two points $\mathbf x, \mathbf q \in \mathcal{S}^{d-1}$ and significantly large $D$ random vectors $\mathbf r_i$, w.l.o.g. we assume that $\mathbf r_1 = \mathop{\mathrm{arg\,max}}\limits_{\mathbf r_i}{|\mathbf q^\top \mathbf r_i|}$. Then, we have

Figures (17)

  • Figure 1: Reachability-plot dendrograms of OPTICS and sOPTICS over L2 and L1 on Mnist. While sOPTICS needs less than 30 seconds, scikit-learn OPTICS requires 1.5 hours on L2 and 0.5 hours on L1.
  • Figure 2: sOPTICS's graphs on L2, cosine, $\chi^2$, and JS on Mnist. Each runs in less than 3 seconds.
  • Figure 3: NMI comparison of DBSCAN variants on cosine, L2, L1, JS distances on Mnist over suggested ranges of $\varepsilon$ by sOPTICS in Figure \ref{['fig:sOptics-Mnist']}. pDBSCAN and scikit-learn have identical results. As the sampling-based DBSCAN implementations only support L2 and cosine distances, we report our implemented sngDBSCAN on L1 and JS.
  • Figure 4: sOPTICS's graphs on L1, L2 and cosine distances on Pamap2. Each runs in less than 2 minutes.
  • Figure 5: sOPTICS's graphs on L2, cosine, $\chi^2$, and JS distances on Mnist8m. Each runs in less than 11 minutes.
  • ...and 12 more figures

Theorems & Definitions (4)

  • lemma 1
  • lemma 2
  • lemma 3
  • lemma 4