Table of Contents
Fetching ...

Dynamic data summarization for hierarchical spatial clustering

Kayumov Abduaziz, Min Sik Kim, Ji Sun Shin

TL;DR

This work tackles dynamic hierarchical spatial clustering by analyzing the hardness of maintaining HDBSCAN's MST under point insertions and deletions and proposing an online–offline framework that combines Bubble-tree online summarization with offline data-bubble-based clustering. The exact dynamic approach is found to be impractical for modern workloads, prompting the Bubble-tree data summarization method which preserves clustering quality while enabling fast updates. Across synthetic and real-world datasets, the Bubble-tree approach achieves high-quality clustering comparable to static methods with substantial speedups, outperforming streaming and fixed-bubble alternatives in fully dynamic scenarios. The framework supports scalable analysis of dynamic spatial data and opens avenues for integration with production-grade clustering pipelines.

Abstract

Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) finds meaningful patterns in spatial data by considering density and spatial proximity. As the clustering algorithm is inherently designed for static applications, so have recent studies focused on accelerating the algorithm for static applications using approximate or parallel methods. However, much less attention has been given to dynamic environments, where even a single point insertion or deletion can require recomputing the clustering hierarchy from scratch due to the need of maintaining the minimum spanning tree (MST) over a complete graph. This paper addresses the challenge of enhancing the clustering algorithm for dynamic data. We present an exact algorithm that maintains density information and updates the clustering hierarchy of HDBSCAN during point insertions and deletions. Considering the hardness of adapting the exact algorithm to dynamic data involving modern workloads, we propose an online-offline framework. The online component efficiently summarizes dynamic data using a tree structure, called Bubble-tree, while the offline step performs the static clustering. Experimental results demonstrate that the data summarization adapts well to fully dynamic environments, providing compression quality on par with existing techniques while significantly improving runtime performance of the clustering algorithm in dynamic data workloads.

Dynamic data summarization for hierarchical spatial clustering

TL;DR

This work tackles dynamic hierarchical spatial clustering by analyzing the hardness of maintaining HDBSCAN's MST under point insertions and deletions and proposing an online–offline framework that combines Bubble-tree online summarization with offline data-bubble-based clustering. The exact dynamic approach is found to be impractical for modern workloads, prompting the Bubble-tree data summarization method which preserves clustering quality while enabling fast updates. Across synthetic and real-world datasets, the Bubble-tree approach achieves high-quality clustering comparable to static methods with substantial speedups, outperforming streaming and fixed-bubble alternatives in fully dynamic scenarios. The framework supports scalable analysis of dynamic spatial data and opens avenues for integration with production-grade clustering pipelines.

Abstract

Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) finds meaningful patterns in spatial data by considering density and spatial proximity. As the clustering algorithm is inherently designed for static applications, so have recent studies focused on accelerating the algorithm for static applications using approximate or parallel methods. However, much less attention has been given to dynamic environments, where even a single point insertion or deletion can require recomputing the clustering hierarchy from scratch due to the need of maintaining the minimum spanning tree (MST) over a complete graph. This paper addresses the challenge of enhancing the clustering algorithm for dynamic data. We present an exact algorithm that maintains density information and updates the clustering hierarchy of HDBSCAN during point insertions and deletions. Considering the hardness of adapting the exact algorithm to dynamic data involving modern workloads, we propose an online-offline framework. The online component efficiently summarizes dynamic data using a tree structure, called Bubble-tree, while the offline step performs the static clustering. Experimental results demonstrate that the data summarization adapts well to fully dynamic environments, providing compression quality on par with existing techniques while significantly improving runtime performance of the clustering algorithm in dynamic data workloads.

Paper Structure

This paper contains 23 sections, 2 theorems, 10 equations, 7 figures, 1 table, 6 algorithms.

Key Result

Lemma 1

Given an MST $T$ of a mutual reachability graph $G$, the insertion of point $p$ into $G$ requires $\Omega (n \log n)$ time to compute the updated MST $T'$ exactly.

Figures (7)

  • Figure 1: An illustration of HDBSCAN clustering results performed on 2D example data for $minPts = 3$. The neighborhood information (a) includes core distance values of data points computed using their nearest neighbors. The distance matrix (b) shows all pairwise mutual reachability distances, used as weights to compute the MST (c). The clustering hierarchy, called dendrogram (d), is obtained from the MST by removing the edges in decreasing order of weights, resulting in two major clusters $A$ and $E$.
  • Figure 2: An illustration of HDBSCAN clustering results for $minPts = 3$ performed for the same 2D example data shown in Figure \ref{['fig:mst_example_before']} updated with the insertion of point $p$. The neighborhood information (a) highlights updated core distance values of points $b$ and $e$. The distance matrix (b) shows updated mutual reachability distances which are reflected in the MST (c). The clustering hierarchy shows an emergence of a single cluster (d).
  • Figure 3: Feasibility analysis of the exact dynamic algorithm on the Gaussian Mixtures dataset of 100K points for $minPts = 10$.
  • Figure 4: Data summarization performed on the 2D example dataset illustrates the differences between ClusTree (a--c) and Bubble-tree (e--g) in incremental settings: original data points (shown in light gray) are inserted incrementally. The leaf nodes of the tree structures are shown with plus signs: ClusTree (+) and Bubble-tree (+), the larger the plus sign is, the more data points the leaf node absorbs. For the first 200 insertions (a and e), both tree structures compressed the data well, however, next insertions cause ClusTree to create overfilled leaf nodes that has no chance of being split other than inserting more points. The resulting leaf nodes from both tree structures were used as compressed data to compute the HDBSCAN clustering results shown in (d) and (h), for ClusTree and Bubble-tree respectively, illustrating how well the tree structures summarize the dynamic data to be used with the clustering algorithm (best viewed in color).
  • Figure 5: Running time comparison of data summarization techniques: ClusTree configured with a maximum tree height of 10, roughly equivalent to a 1% compression rate used in both Bubble-tree and Incremental approaches.
  • ...and 2 more figures

Theorems & Definitions (9)

  • Definition 1: Core distance
  • Definition 2: Mutual reachability distance
  • Definition 3: Mutual reachability graph
  • Definition 4: Clustering features, $CF$
  • Definition 5: Data bubble, $B$
  • Lemma 1
  • Proof 1
  • Lemma 2
  • Proof 2