Dynamic data summarization for hierarchical spatial clustering
Kayumov Abduaziz, Min Sik Kim, Ji Sun Shin
TL;DR
This work tackles dynamic hierarchical spatial clustering by analyzing the hardness of maintaining HDBSCAN's MST under point insertions and deletions and proposing an online–offline framework that combines Bubble-tree online summarization with offline data-bubble-based clustering. The exact dynamic approach is found to be impractical for modern workloads, prompting the Bubble-tree data summarization method which preserves clustering quality while enabling fast updates. Across synthetic and real-world datasets, the Bubble-tree approach achieves high-quality clustering comparable to static methods with substantial speedups, outperforming streaming and fixed-bubble alternatives in fully dynamic scenarios. The framework supports scalable analysis of dynamic spatial data and opens avenues for integration with production-grade clustering pipelines.
Abstract
Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) finds meaningful patterns in spatial data by considering density and spatial proximity. As the clustering algorithm is inherently designed for static applications, so have recent studies focused on accelerating the algorithm for static applications using approximate or parallel methods. However, much less attention has been given to dynamic environments, where even a single point insertion or deletion can require recomputing the clustering hierarchy from scratch due to the need of maintaining the minimum spanning tree (MST) over a complete graph. This paper addresses the challenge of enhancing the clustering algorithm for dynamic data. We present an exact algorithm that maintains density information and updates the clustering hierarchy of HDBSCAN during point insertions and deletions. Considering the hardness of adapting the exact algorithm to dynamic data involving modern workloads, we propose an online-offline framework. The online component efficiently summarizes dynamic data using a tree structure, called Bubble-tree, while the offline step performs the static clustering. Experimental results demonstrate that the data summarization adapts well to fully dynamic environments, providing compression quality on par with existing techniques while significantly improving runtime performance of the clustering algorithm in dynamic data workloads.
