Scalable Density-based Clustering with Random Projections

Haochuan Xu; Ninh Pham

Scalable Density-based Clustering with Random Projections

Haochuan Xu, Ninh Pham

TL;DR

The paper tackles the scalability challenge of density-based clustering in high dimensions by introducing sDBSCAN, which leverages neighborhood-preserving random projections to rapidly identify core points and neighborhoods, and sOPTICS for interactive exploration of clustering structure. The authors provide theoretical guarantees showing that sDBSCAN yields a clustering structure close to DBSCAN under mild conditions for cosine distance, and extend both methods to $L_2$, $L_1$, $\boldsymbol{\chi^2}$, and Jensen-Shannon distances via random kernel features. They also demonstrate practical efficacy and scalability through extensive experiments on million-point datasets (e.g., Mnist, Mnist8m, Pamap2), achieving orders of magnitude faster performance than scikit-learn equivalents while maintaining competitive clustering quality. The approach is multi-thread friendly and benefits from random-projection-based ANNS, enabling fast, scalable density-based clustering across diverse distance measures with real-world impact for large-scale data analysis.

Abstract

We present sDBSCAN, a scalable density-based clustering algorithm in high dimensions with cosine distance. Utilizing the neighborhood-preserving property of random projections, sDBSCAN can quickly identify core points and their neighborhoods, the primary hurdle of density-based clustering. Theoretically, sDBSCAN outputs a clustering structure similar to DBSCAN under mild conditions with high probability. To further facilitate sDBSCAN, we present sOPTICS, a scalable OPTICS for interactive exploration of the intrinsic clustering structure. We also extend sDBSCAN and sOPTICS to L2, L1, $χ^2$, and Jensen-Shannon distances via random kernel features. Empirically, sDBSCAN is significantly faster and provides higher accuracy than many other clustering algorithms on real-world million-point data sets. On these data sets, sDBSCAN and sOPTICS run in a few minutes, while the scikit-learn's counterparts demand several hours or cannot run due to memory constraints.

Scalable Density-based Clustering with Random Projections

TL;DR

, and Jensen-Shannon distances via random kernel features. They also demonstrate practical efficacy and scalability through extensive experiments on million-point datasets (e.g., Mnist, Mnist8m, Pamap2), achieving orders of magnitude faster performance than scikit-learn equivalents while maintaining competitive clustering quality. The approach is multi-thread friendly and benefits from random-projection-based ANNS, enabling fast, scalable density-based clustering across diverse distance measures with real-world impact for large-scale data analysis.

Abstract

, and Jensen-Shannon distances via random kernel features. Empirically, sDBSCAN is significantly faster and provides higher accuracy than many other clustering algorithms on real-world million-point data sets. On these data sets, sDBSCAN and sOPTICS run in a few minutes, while the scikit-learn's counterparts demand several hours or cannot run due to memory constraints.

Paper Structure (33 sections, 4 theorems, 6 equations, 17 figures, 5 tables, 6 algorithms)

This paper contains 33 sections, 4 theorems, 6 equations, 17 figures, 5 tables, 6 algorithms.

Introduction
Preliminary
DBSCAN
Challenges of DBSCAN in High Dimensions
OPTICS
Random projection-based ANNS
Random projection-based methods
Preprocessing and finding core points
sDBSCAN and sOPTICS
Theoretical analysis
Identify core points
Ensure sDBSCAN's quality
Discuss sOPTICS's quality
Extend to other distance measures
From theory to practice
...and 18 more sections

Key Result

lemma 1

CEOs For two points $\mathbf x, \mathbf q \in \mathcal{S}^{d-1}$ and significantly large $D$ random vectors $\mathbf r_i$, w.l.o.g. we assume that $\mathbf r_1 = \mathop{\mathrm{arg\,max}}\limits_{\mathbf r_i}{|\mathbf q^\top \mathbf r_i|}$. Then, we have

Figures (17)

Figure 1: Reachability-plot dendrograms of OPTICS and sOPTICS over L2 and L1 on Mnist. While sOPTICS needs less than 30 seconds, scikit-learn OPTICS requires 1.5 hours on L2 and 0.5 hours on L1.
Figure 2: sOPTICS's graphs on L2, cosine, $\chi^2$, and JS on Mnist. Each runs in less than 3 seconds.
Figure 3: NMI comparison of DBSCAN variants on cosine, L2, L1, JS distances on Mnist over suggested ranges of $\varepsilon$ by sOPTICS in Figure \ref{['fig:sOptics-Mnist']}. pDBSCAN and scikit-learn have identical results. As the sampling-based DBSCAN implementations only support L2 and cosine distances, we report our implemented sngDBSCAN on L1 and JS.
Figure 4: sOPTICS's graphs on L1, L2 and cosine distances on Pamap2. Each runs in less than 2 minutes.
Figure 5: sOPTICS's graphs on L2, cosine, $\chi^2$, and JS distances on Mnist8m. Each runs in less than 11 minutes.
...and 12 more figures

Theorems & Definitions (4)

lemma 1
lemma 2
lemma 3
lemma 4

Scalable Density-based Clustering with Random Projections

TL;DR

Abstract

Scalable Density-based Clustering with Random Projections

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (4)