Enabling DBSCAN for Very Large-Scale High-Dimensional Spaces
Yongyu Wang
TL;DR
DBSCAN's neighborhood search requires $O(n^2 \beta)$ time with $\beta=O(D)$, causing scalability bottlenecks in very large-scale high-dimensional spaces. The paper introduces a spectrum-preserving data compression method that builds a $k$-NN graph, uses graph Laplacian spectral embedding to represent data, and defines a spectral similarity $s_{uv}$ to cluster spectrally similar points into pseudo-samples, enabling efficient DBSCAN with label propagation to original points. Experiments on Pendigits, USPS, and MNIST show that clustering accuracy remains high as compression increases, while runtime and memory usage are substantially reduced, yielding significant speedups. This approach enables DBSCAN to tackle large-scale high-dimensional datasets and is amenable to resource-constrained hardware such as FPGAs and handheld devices.
Abstract
DBSCAN is one of the most important non-parametric unsupervised data analysis tools. By applying DBSCAN to a dataset, two key analytical results can be obtained: (1) clustering data points based on density distribution and (2) identifying outliers in the dataset. However, the time complexity of the DBSCAN algorithm is $O(n^2 β)$, where $n$ is the number of data points and $β= O(D)$, with $D$ representing the dimensionality of the data space. As a result, DBSCAN becomes computationally infeasible when both $n$ and $D$ are large. In this paper, we propose a DBSCAN method based on spectral data compression, capable of efficiently processing datasets with a large number of data points ($n$) and high dimensionality ($D$). By preserving only the most critical structural information during the compression process, our method effectively removes substantial redundancy and noise. Consequently, the solution quality of DBSCAN is significantly improved, enabling more accurate and reliable results.
