Table of Contents
Fetching ...

Towards Metric DBSCAN: Exact, Approximate, and Streaming Algorithms

Guanlin Mo, Shihong Song, Hu Ding

TL;DR

The paper extends DBSCAN to abstract metric spaces by exploiting low intrinsic (doubling) dimension among inliers, while allowing outliers to be arbitrary. It introduces a radius-guided Gonzalez's algorithm to obtain a basis for efficient exact DBSCAN, develops a linear-time $\rho$-approximate variant, and provides a streaming variant with memory independent of the input size. The exact algorithm uses a pre-processing step based on radius-guided $k$-center clustering and cover trees to label core points and merge clusters with bichromatic closest-pair queries, yielding near-linear time under key assumptions. Experiments across Euclidean and non-Euclidean data demonstrate substantial speedups over established DBSCAN variants, with competitive clustering quality and practical streaming capabilities, highlighting strong potential for large-scale, high-dimensional or non-Euclidean clustering tasks.

Abstract

DBSCAN is a popular density-based clustering algorithm that has many different applications in practice. However, the running time of DBSCAN in high-dimensional space or general metric space ({\em e.g.,} clustering a set of texts by using edit distance) can be as large as quadratic in the input size. Moreover, most of existing accelerating techniques for DBSCAN are only available for low-dimensional Euclidean space. In this paper, we study the DBSCAN problem under the assumption that the inliers (the core points and border points) have a low intrinsic dimension (which is a realistic assumption for many high-dimensional applications), where the outliers can locate anywhere in the space without any assumption. First, we propose a $k$-center clustering based algorithm that can reduce the time-consuming labeling and merging tasks of DBSCAN to be linear. Further, we propose a linear time approximate DBSCAN algorithm, where the key idea is building a novel small-size summary for the core points. Also, our algorithm can be efficiently implemented for streaming data and the required memory is independent of the input size. Finally, we conduct our experiments and compare our algorithms with several popular DBSCAN algorithms. The experimental results suggest that our proposed approach can significantly reduce the computational complexity in practice.

Towards Metric DBSCAN: Exact, Approximate, and Streaming Algorithms

TL;DR

The paper extends DBSCAN to abstract metric spaces by exploiting low intrinsic (doubling) dimension among inliers, while allowing outliers to be arbitrary. It introduces a radius-guided Gonzalez's algorithm to obtain a basis for efficient exact DBSCAN, develops a linear-time -approximate variant, and provides a streaming variant with memory independent of the input size. The exact algorithm uses a pre-processing step based on radius-guided -center clustering and cover trees to label core points and merge clusters with bichromatic closest-pair queries, yielding near-linear time under key assumptions. Experiments across Euclidean and non-Euclidean data demonstrate substantial speedups over established DBSCAN variants, with competitive clustering quality and practical streaming capabilities, highlighting strong potential for large-scale, high-dimensional or non-Euclidean clustering tasks.

Abstract

DBSCAN is a popular density-based clustering algorithm that has many different applications in practice. However, the running time of DBSCAN in high-dimensional space or general metric space ({\em e.g.,} clustering a set of texts by using edit distance) can be as large as quadratic in the input size. Moreover, most of existing accelerating techniques for DBSCAN are only available for low-dimensional Euclidean space. In this paper, we study the DBSCAN problem under the assumption that the inliers (the core points and border points) have a low intrinsic dimension (which is a realistic assumption for many high-dimensional applications), where the outliers can locate anywhere in the space without any assumption. First, we propose a -center clustering based algorithm that can reduce the time-consuming labeling and merging tasks of DBSCAN to be linear. Further, we propose a linear time approximate DBSCAN algorithm, where the key idea is building a novel small-size summary for the core points. Also, our algorithm can be efficiently implemented for streaming data and the required memory is independent of the input size. Finally, we conduct our experiments and compare our algorithms with several popular DBSCAN algorithms. The experimental results suggest that our proposed approach can significantly reduce the computational complexity in practice.
Paper Structure (22 sections, 17 theorems, 6 equations, 6 figures, 4 tables, 3 algorithms)

This paper contains 22 sections, 17 theorems, 6 equations, 6 figures, 4 tables, 3 algorithms.

Key Result

Proposition 1

Suppose the doubling dimension of a metric space $(X, \mathtt{dis})$ is $D$. For any point set $Y \subseteq X$, we have $|Y|\le 2^{D\lceil \log \alpha \rceil}$, where $\alpha$ is the aspect ratio of $Y$, i.e., $\alpha = \frac{\max_{y,y'\in Y}\mathtt{dis}(y,y')}{\min_{y,y'\in Y}\mathtt{dis}(y,y')}$.

Figures (6)

  • Figure 1: An example for DBSCAN with $MinPts=4$: the solid points are core points, the point $p_1$ is a border point, and the point $p_2$ is an outlier.
  • Figure 2: The sets $\tilde{C}_{e_1}$ (yellow points), $\tilde{C}_{e_2}$ (blue points), and $\tilde{C}_{e_3}$ (green points) are shown in the figure. $\tilde{C}_{e_1}$ and $\tilde{C}_{e_2}$ should be merged into the same cluster, because their closest pair distance is less than $\epsilon$; on the other hand, $\tilde{C}_{e_3}$ should be merged to a different cluster
  • Figure 3: Running time with varying $\epsilon$. Some baseline algorithms are not tested in some figures, because either they run too slowly ($>10^6$s) on the high-dimensional data, or they cannot run on the non-Euclidean data.
  • Figure 4: The ARI and AMI with fixed $\epsilon$ and different $\rho$.
  • Figure 5: Clustering results of exact DBSCAN, our approximate algorithm with $\rho=0.5$ and DP-means. The points with same color belong to the same cluster, and the red points are outliers.
  • ...and 1 more figures

Theorems & Definitions (29)

  • Definition 1: DBSCAN ester1996density
  • Definition 2: $\rho$-approximate DBSCAN gan2015dbscan
  • Remark 1
  • Definition 3: Doubling dimension gupta2003bounded
  • Proposition 1: talwar2004bypassingkrauthgamer2004navigating
  • Definition 4: $r$-net
  • Claim 1: complexities of cover tree
  • Remark 2
  • Lemma 1
  • Remark 3
  • ...and 19 more